Diffusers / Stable Diffusion web UI / Diffusion Models
Stable Diffusion web UI セットアップ
sudo apt install --no-install-recommends google-perftools
sudo apt install bc
$ git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
$ cd stable-diffusion-webui
$ bash ./webui.sh
################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################
...
Downloading: "https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors" to /home/xxx/yyy/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
85%|██████████████████████████████████████████████████████████████████ | 3.36G/3.97G [00:53<00:15, 42.0MB/s]
起動
ブラウザで localhost:7860 にアクセス。
データサイズ
$du -h
...
4.0G ./models/Stable-diffusion
...
4.3G .
Diffusers
モデルの本体
~/.cache/huggingface/hub/ にある。
$ ls -1 ~/.cache/huggingface/hub/
models--openai--clip-vit-large-patch14
models--runwayml--stable-diffusion-v1-5
models--stabilityai--stable-diffusion-xl-base-1.0
version.txt
version_diffusers_cache.txt
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
$ pip install 'huggingface_hub[cli,torch]'
$ huggingface-cli login
してから以下でダウンロード。15 GB くらいのモデル。c.f. stabilityai/stable-diffusion-3-medium-diffusers
model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
pipeline = DiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
)
pipeline.to("cuda")
data did not match any variant of untagged enum PyPreTokenizerTypeWrapper
が出る場合は、transformers を更新 (>4.37.2)。c.f. SD3 - data did not match any variant of untagged enum PyPreTokenizerTypeWrapper, data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 960 column 3
OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU
model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
pipeline = DiffusionPipeline.from_pretrained(
model_id,
text_encoder_3=None,
tokenizer_3=None,
torch_dtype=torch.float16,
)
pipeline.to("cuda")
ならいける:
6618MiB / 16376MiB
Passing
scaleviajoint_attention_kwargswhen not using the PEFT backend is ineffective.
pip install peft
cagliostrolab/animagine-xl-3.1
sketch2lineart などで使われているモデル。
DiffusionPipeline.from_pretrained でロードさせようとした場合、https://huggingface.co/api/models/cagliostrolab/animagine-xl-3.1 から情報を取得してチェックポイントなどのダウンロードが始まる。これらのファイルのマスタは https://huggingface.co/cagliostrolab/animagine-xl-3.1/tree/main にある。
主に大きいのはノイズを予測するモデルとして使われる unet と、テキストエンコーダ CLIP で、それぞれ 5.14GB と 1.39GB くらいである。
$ cd ~/.cache/huggingface/hub/models--cagliostrolab--animagine-xl-3.1/blobs
$ ls -l
-rw-rw-r-- 1 user user 1389382176 Jun 30 17:30 b2f6541a64af35bb59231e19c8efa840bdb076554ee17d4b418c4197ff07fc87
-rw-rw-r-- 1 user user 5135149760 Jun 30 17:32 c1e43f5fa892e1c54c99fc7caebf9c3426910ea5a730861ff89dead23b9f260e
sketch2lineart
AIでラフを線画に整えるだけの無料webアプリ『sketch2lineart』公開 が技術的に気になったので手元で動かしてみた。
https://huggingface.co/spaces/tori29umai/sketch2lineart
インストール
どうやら PyTorch 2.2 でないとうまく嚙み合わない?ようなので、
$ pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
$ pip install -r requirement.txt
起動
$ python app.py
gradio の一時的な URL が発行されて 72 時間だけ使えるようになるみたい。
サイズ
$ cd sketch2lineart
$ du -h
2.4G ./controlnet
362M ./tagger
82M ./lora
24K ./utils
...
2.8G .
$ cd ~/.cache/huggingface/hub
$ du -h models--cagliostrolab--animagine-xl-3.1
6.4G models--cagliostrolab--animagine-xl-3.1
$ du -h models--madebyollin--sdxl-vae-fp16-fix
320M models--madebyollin--sdxl-vae-fp16-fix
初回変換時
必要なモデルがダウンロードされて ~/.cache 以下に配置される。
config.json: 100%|█████████████████████████████████████████████████████████████████████| 631/631 [00:00<00:00, 2.66MB/s]
diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████| 335M/335M [00:03<00:00, 92.7MB/s]
model_index.json: 100%|████████████████████████████████████████████████████████████████| 680/680 [00:00<00:00, 3.98MB/s]
text_encoder/config.json: 100%|████████████████████████████████████████████████████████| 560/560 [00:00<00:00, 6.09MB/s]
text_encoder_2/config.json: 100%|██████████████████████████████████████████████████████| 570/570 [00:00<00:00, 4.46MB/s]
tokenizer/tokenizer_config.json: 100%|█████████████████████████████████████████████████| 705/705 [00:00<00:00, 6.81MB/s]
scheduler/scheduler_config.json: 100%|█████████████████████████████████████████████████| 464/464 [00:00<00:00, 4.31MB/s]
tokenizer/special_tokens_map.json: 100%|███████████████████████████████████████████████| 588/588 [00:00<00:00, 5.20MB/s]
tokenizer_2/tokenizer_config.json: 100%|███████████████████████████████████████████████| 856/856 [00:00<00:00, 5.86MB/s]
tokenizer_2/special_tokens_map.json: 100%|█████████████████████████████████████████████| 462/462 [00:00<00:00, 3.76MB/s]
tokenizer/merges.txt: 100%|██████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 1.11MB/s]
unet/config.json: 100%|████████████████████████████████████████████████████████████| 1.77k/1.77k [00:00<00:00, 11.3MB/s]
tokenizer/vocab.json: 100%|████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.81MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████| 246M/246M [00:06<00:00, 35.7MB/s]
diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████| 5.14G/5.14G [00:19<00:00, 264MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████| 1.39G/1.39G [01:00<00:00, 22.9MB/s]
Fetching 16 files: 100%|████████████████████████████████████████████████████████████████| 16/16 [01:00<00:00, 3.81s/it]
変換
Loading pipeline components...: 100%|█████████████████████████████████████████████████████| 7/7 [00:01<00:00, 5.70it/s]
masterpiece, best quality, monochrome, greyscale, lineart, white background
100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [00:45<00:00, 1.51s/it]
Time taken: 53.10483503341675
変換サンプル
これは凄い・・・

ControlNet と LoRA を用いているようだが、LoRA のほうで線画生成を担当させているのかもしれない。
gsdf/Counterfeit-V2.5
vae
convert_vae_pt_to_diffusers.py の実装にしても、AutoencoderKL.from_single_file にしても pytorch-lightning をインストールしていないと実行時例外が出る。とりあえず以下で動いた・・・。
$ pip list | grep pytorch-lightning
pytorch-lightning 2.3.1
pip install -qU pytorch-lightning
$ curl -LO https://huggingface.co/gsdf/Counterfeit-V2.5/resolve/main/Counterfeit-V2.5.vae.pt
$ curl -LO https://raw.githubusercontent.com/huggingface/diffusers/main/scripts/convert_vae_pt_to_diffusers.py
$ python convert_vae_pt_to_diffusers.py --vae_pt_path Counterfeit-V2.5.vae.pt --dump_path counterfeit_vae
この変換はせずに
$ curl -LO https://huggingface.co/gsdf/Counterfeit-V2.5/resolve/main/vae/config.json
$ curl -LO https://huggingface.co/gsdf/Counterfeit-V2.5/resolve/main/vae/diffusion_pytorch_model.safetensors
が良いかも。
import torch
from diffusers import (
StableDiffusionPipeline,
AutoencoderKL,
)
して
hub_dir = Path(os.getenv("HOME"))/".cache/huggingface/hub"
vae_st = str(hub_dir/"models--gsdf--Counterfeit-V2.5-vae/diffusion_pytorch_model.safetensors")
vae = AutoencoderKL.from_single_file(vae_st, torch_dtype=torch.float16).to_empty(device="cuda")
みたいなので VAE は作れるのだが、StableDiffusionPipeline に与えても真っ黒画像が出てきたのでよく分からない。本体側の U-Net と噛み合っていないか?または自動で blob から拾って適用しているのかもしれない・・・。実際 pipe.vae を見ると中身に AutoencoderKL が入っている。
pipe = StableDiffusionPipeline.from_pretrained(
"gsdf/Counterfeit-V2.5",
#vae=vae,
torch_dtype=torch.float16
).to("cuda")
pipe.load_textual_inversion(
"gsdf/Counterfeit-V3.0",
weight_name="embedding/EasyNegativeV2.safetensors",
token="EasyNegativeV2"
)
result = pipe(prompt="girl eating pizza", negative_prompt="EasyNegativeV2", width=512, height=512, num_inference_steps=200)
display(result.images[0])
gsdf/Counterfeit-V3.0
model_index.json が置いていないので StableDiffusionPipeline.from_pretrained が使えない。代わりに safetensors をダウンロードしてローカルで読み込む。VAE を使う必要はない?と言うか V2.5 の VAE の連携も分からなかったので一旦無視。
$ mkdir ~/.cache/huggingface/hub/models--gsdf--Counterfeit-V3.0
$ cd ~/.cache/huggingface/hub/models--gsdf--Counterfeit-V3.0
$ curl -LO https://huggingface.co/gsdf/Counterfeit-V3.0/resolve/main/Counterfeit-V3.0_fix_fp16.safetensors
恐らく以下のように使えば良いと思う:
from __future__ import annotations
import os
from pathlib import Path
import torch
from diffusers import StableDiffusionPipeline
hub_dir = Path(os.getenv("HOME"))/".cache/huggingface/hub"
model = str(hub_dir/"models--gsdf--Counterfeit-V3.0/Counterfeit-V3.0_fix_fp16.safetensors")
pipe = StableDiffusionPipeline.from_single_file(
model,
text_encoder_3=None,
tokenizer_3=None,
variant="fp16",
torch_dtype=torch.float16
).to("cuda")
pipe.load_textual_inversion(
"gsdf/Counterfeit-V3.0",
weight_name="embedding/EasyNegativeV2.safetensors",
token="EasyNegativeV2"
)
result = pipe(
prompt="girl eating pizza",
negative_prompt="EasyNegativeV2",
width=512,
height=512,
num_inference_steps=200
)
display(result.images[0])
お気持ち駆動 Langevin 動力学
Langevin 動力学 (Wikipedia)
- 分子系の動力学の数理モデリングのための手法
Langevin 方程式 (Wikipedia)
右辺第一項は粘性力、第二項はランダム力。
より一般的な表記なのかは分からないが、非平衡統計力学 では、
という式が書かれている。水中のコロイド粒子などのケースでは、上記は近似的に
という “overdamped Langevin equation” と呼ばれるものになるそうで、SDE の形で表した (B.16) は直下の arXiv の SDE とよく符合している。
arXiv:2011.13456 Score-Based Generative Modeling through Stochastic Differential Equations
順過程(拡散過程)
上記の overdamped Langevin equation の SDE 表記の式が書かれている:
ここで
逆過程(逆拡散過程)
ここで
DPM / DDPM
この形の類似性は、DPM 論文 arXiv:1503.03585 Deep Unsupervised Learning using Nonequilibrium Thermodynamics の
Kolmogorov の前進・後退方程式 (Feller, 1949) は,多くの前進拡散過程について,逆拡散過程も同じ関数形で記述できることを示している.
と符合していそうだ。Sec. 2 では
我々の目標は,任意の複雑なデータ分布を単純で扱いやすい分布に変換する前進(または推論)拡散過程を定義し,生成モデル分布を定義するこの拡散過程の有限時間反転を学習することである(Figure 1 を参照).
とある。DDPM 論文では拡散率
Denoising Diffusion Probabilistic Models
で、十分大きな
更新式
順過程の式は
という更新式としての解釈が出来そうである。
真偽は不明だが、ChatGPT 曰く「形式的には
逆過程の式も、
スコア関数
c.f Estimation of Non-Normalized Statistical Models by Score Matching (The Journal of Machine Learning Research, Volume 6 Pages 695 - 709)
正規化されていない分布
というモデルを考える時に、
を計算すると
Kolmogorov の後退方程式・前進方程式
不連続過程の前進方程式と後退方程式は,物理学者にはおなじみの,いわゆるFokker-Planck 方程式
に対応するものであり,Kolmogoroff が一般理論の基礎を築く際に手本としたものである. u (\tau, \xi; t, x)
前進方程式
後退方程式
これらの最もよく知られた解は均一拡散の場合で、
らしい。
(時間的に一様な)Kolmogorov の前進方程式
(時間的に一様な)Kolmogorov の後退方程式
※ 一種の「非線型拡散方程式」?
arXiv:cond-mat/9707325 Equilibrium free energy differences from nonequilibrium measurements: a master equation approach
C. Jarzynski の arXiv:cond-mat/9707325 Equilibrium free energy differences from nonequilibrium measurements: a master equation approach を見ると、B. Langevin evolution において、
次に,ハミルトン方程式を修正し,摩擦力と確率的な力の両方を加えることを考えてみよう:
という式が出てくる。これのことを DPM 論文では「Jarzynski 方程式」と呼んでいそうである。
訓練
Train a diffusion model を参考にする。
$ open-clip-torch
$ pip install -U "diffusers[training]"
Successfully installed absl-py-2.1.0 accelerate-0.32.1 markdown-3.6 protobuf-3.20.3 tensorboard-2.17.0 tensorboard-data-server-0.7.2 werkzeug-3.0.3
くらいで良い?
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
protoc-gen-openapiv2 0.0.1 requires protobuf>=4.21.0, but you have protobuf 3.20.3 which is incompatible.
と出るので追加で:
$ pip install -U "protobuf<5"
チュートリアルのままの訓練でも結構 GPU のメモリは食う:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:51:00.0 Off | Off |
| 67% 87C P2 139W / 140W | 10746MiB / 16376MiB | 97% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
city96/FLUX.1-dev-gguf
量子化された FLUX.1 を使いたい。
$ mkdir ~/.cache/huggingface/hub/models--city96--FLUX.1-dev-gguf
$ cd ~/.cache/huggingface/hub/models--city96--FLUX.1-dev-gguf
$ curl -LO https://huggingface.co/city96/FLUX.1-dev-gguf/resolve/main/flux1-dev-Q8_0.gguf
以下はダメ。torch 用ではないので。stable-diffusion.cpp でロードできるらしい。
hub_dir = Path(os.getenv("HOME"))/".cache/huggingface/hub"
model = str(hub_dir/"models--city96--FLUX.1-dev-gguf/flux1-dev-Q8_0.gguf")
pipeline = StableDiffusionPipeline.from_single_file(
model,
#text_encoder_3=None,
#tokenizer_3=None,
torch_dtype=torch.float16,
).to("cuda")
ということで
$ ldconfig -p | grep openblas
で OpenBLAS の存在を確認して、もしもなければ
$ sudo apt-get install libopenblas-dev
そして、
$ sudo apt install ccache
$ git clone --recursive https://github.com/leejet/stable-diffusion.cpp
$ cd stable-diffusion.cpp
$ mkdir build
$ cd build
$ cmake .. -DGGML_OPENBLAS=ON
$ cmake --build . --config Release
或いは以下で確認して cuBLAS が使えそうなら
$ cat /usr/local/cuda/include/cublas_api.h | grep CUBLAS_VER
以下のように:
$ sudo apt install ccache
$ git clone --recursive https://github.com/leejet/stable-diffusion.cpp
$ cd stable-diffusion.cpp
$ mkdir build
$ cd build
$ cmake .. -DSD_CUBLAS=ON
$ cmake --build . --config Release
ビルドは 1 時間以上かかったと思う。
$ cmake --build . --config Release
[ 1%] Building C object thirdparty/CMakeFiles/zip.dir/zip.c.o
In file included from /home/xxx/git_work/stable-diffusion.cpp/thirdparty/zip.c:40:
/home/xxx/git_work/stable-diffusion.cpp/thirdparty/miniz.h:4988:9: note: ‘#pragma message: Using fopen, ftello, fseeko, stat() etc. path for file I/O - this path may not support large files.’
4988 | #pragma message( \
| ^~~~~~~
[ 1%] Built target zip
[ 2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[ 3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 5%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[ 7%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
[ 8%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/arange.cu.o
[ 10%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/argsort.cu.o
[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/binbcast.cu.o
[ 12%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/clamp.cu.o
[ 14%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/concat.cu.o
[ 15%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/conv-transpose-1d.cu.o
[ 16%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/convert.cu.o
[ 17%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/cpy.cu.o
[ 19%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/cross-entropy-loss.cu.o
[ 20%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/diagmask.cu.o
[ 21%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/dmmv.cu.o
[ 23%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-tile-f16.cu.o
[ 24%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-tile-f32.cu.o
[ 25%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.o
[ 26%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/getrows.cu.o
[ 28%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/im2col.cu.o
[ 29%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/mmq.cu.o
[ 30%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/mmvq.cu.o
[ 32%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/norm.cu.o
[ 33%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/pad.cu.o
[ 34%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/pool2d.cu.o
[ 35%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/quantize.cu.o
[ 37%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/rope.cu.o
[ 38%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/scale.cu.o
[ 39%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/softmax.cu.o
[ 41%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/sumrows.cu.o
[ 42%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/tsembd.cu.o
[ 43%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/unary.cu.o
[ 44%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/upscale.cu.o
[ 46%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda.cu.o
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
^
[ 47%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.cu.o
[ 48%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.cu.o
[ 50%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.cu.o
[ 51%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.cu.o
[ 52%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.cu.o
[ 53%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq1_s.cu.o
[ 55%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_s.cu.o
[ 56%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_xs.cu.o
[ 57%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu.o
[ 58%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_s.cu.o
[ 60%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu.o
[ 61%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_nl.cu.o
[ 62%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_xs.cu.o
[ 64%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q2_k.cu.o
[ 65%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q3_k.cu.o
[ 66%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_0.cu.o
[ 67%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_1.cu.o
[ 69%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_k.cu.o
[ 70%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_0.cu.o
[ 71%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_1.cu.o
[ 73%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_k.cu.o
[ 74%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q6_k.cu.o
[ 75%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q8_0.cu.o
[ 76%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.cu.o
[ 78%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.cu.o
[ 79%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.cu.o
[ 80%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.cu.o
[ 82%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.cu.o
[ 83%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.cu.o
[ 84%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.cu.o
[ 85%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.cu.o
[ 87%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.cu.o
[ 88%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.cu.o
[ 89%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
[ 91%] Linking CUDA static library libggml.a
[ 91%] Built target ggml
[ 92%] Building CXX object CMakeFiles/stable-diffusion.dir/model.cpp.o
[ 93%] Building CXX object CMakeFiles/stable-diffusion.dir/stable-diffusion.cpp.o
[ 94%] Building CXX object CMakeFiles/stable-diffusion.dir/upscaler.cpp.o
[ 96%] Building CXX object CMakeFiles/stable-diffusion.dir/util.cpp.o
[ 97%] Linking CXX static library libstable-diffusion.a
[ 97%] Built target stable-diffusion
[ 98%] Building CXX object examples/cli/CMakeFiles/sd.dir/main.cpp.o
[100%] Linking CXX executable ../../bin/sd
[100%] Built target sd
「FLUX.1」ローカル環境の使い方!Stable Diffusion WebUI ForgeやComifyUIで画像生成 を参考にモデル類をダウンロード。
但し、モデル本体は city96/FLUX.1-dev-gguf を使う。
$ mkdir models
$ cd models
$ curl -LO https://huggingface.co/city96/FLUX.1-dev-gguf/resolve/main/flux1-dev-Q8_0.gguf
$ curl -LO https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/ae.safetensors
$ curl -LO https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/clip_l.safetensors
$ curl -LO https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors
$ du -h
22G .
city96/FLUX.1-dev-gguf(続き)
stable-diffusion.cpp で CPU で FLUX.1 Schnell を動かす を参考に生成:
$ ./build/bin/sd --diffusion-model models/flux1-dev-Q8_0.gguf --vae ./models/ae.safetensors --clip_l ./models/clip_l.safetensors --t5xxl ./models/t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v
Option:
n_threads: 4
mode: txt2img
model_path:
wtype: unspecified
clip_l_path: ./models/clip_l.safetensors
t5xxl_path: ./models/t5xxl_fp16.safetensors
diffusion_model_path: models/flux1-dev-Q8_0.gguf
vae_path: ./models/ae.safetensors
taesd_path:
esrgan_path:
controlnet_path:
embeddings_path:
stacked_id_embeddings_path:
input_id_images_path:
style ratio: 20.00
normalize input image : false
output_path: output.png
init_img:
control_image:
clip on cpu: false
controlnet cpu: false
vae decoder on cpu:false
strength(control): 0.90
prompt: a lovely cat holding a sign says 'flux.cpp'
negative_prompt:
min_cfg: 1.00
cfg_scale: 1.00
guidance: 3.50
clip_skip: -1
width: 512
height: 512
sample_method: euler
schedule: default
sample_steps: 20
strength(img2img): 0.75
rng: cuda
seed: 42
batch_count: 1
vae_tiling: false
upscale_repeats: 1
System Info:
BLAS = 1
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 1
AVX512_VBMI = 0
AVX512_VNNI = 1
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:157 - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:202 - loading clip_l from './models/clip_l.safetensors'
[INFO ] model.cpp:793 - load ./models/clip_l.safetensors using safetensors format
[DEBUG] model.cpp:861 - init from './models/clip_l.safetensors'
[INFO ] stable-diffusion.cpp:209 - loading t5xxl from './models/t5xxl_fp16.safetensors'
[INFO ] model.cpp:793 - load ./models/t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:861 - init from './models/t5xxl_fp16.safetensors'
[INFO ] stable-diffusion.cpp:216 - loading diffusion model from 'models/flux1-dev-Q8_0.gguf'
[INFO ] model.cpp:790 - load models/flux1-dev-Q8_0.gguf using gguf format
[DEBUG] model.cpp:807 - init from 'models/flux1-dev-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:223 - loading vae from './models/ae.safetensors'
[INFO ] model.cpp:793 - load ./models/ae.safetensors using safetensors format
[DEBUG] model.cpp:861 - init from './models/ae.safetensors'
[INFO ] stable-diffusion.cpp:235 - Version: Flux Dev
[INFO ] stable-diffusion.cpp:266 - Weight type: f16
[INFO ] stable-diffusion.cpp:267 - Conditioner weight type: f16
[INFO ] stable-diffusion.cpp:268 - Diffusion model weight type: q8_0
[INFO ] stable-diffusion.cpp:269 - VAE weight type: f32
[DEBUG] stable-diffusion.cpp:271 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:310 - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:313 - CLIP: Using CPU backend
[DEBUG] clip.hpp:171 - vocab size: 49408
[DEBUG] clip.hpp:182 - trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1046 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1046 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1046 - flux params backend buffer size = 12068.09 MB(VRAM) (780 tensors)
[DEBUG] ggml_extend.hpp:1046 - vae params backend buffer size = 94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:398 - loading weights
[DEBUG] model.cpp:1530 - loading tensors from ./models/clip_l.safetensors
[DEBUG] model.cpp:1530 - loading tensors from ./models/t5xxl_fp16.safetensors
[INFO ] model.cpp:1685 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
[DEBUG] model.cpp:1530 - loading tensors from models/flux1-dev-Q8_0.gguf
[DEBUG] model.cpp:1530 - loading tensors from ./models/ae.safetensors
[INFO ] stable-diffusion.cpp:482 - total params memory size = 21481.50MB (VRAM 12162.66MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 12068.09MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:501 - loading model from '' completed, taking 95.26s
[INFO ] stable-diffusion.cpp:518 - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:572 - finished loaded file
[DEBUG] stable-diffusion.cpp:1378 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1127 - prompt after extract and remove lora: "a lovely cat holding a sign says 'flux.cpp'"
[INFO ] stable-diffusion.cpp:655 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1036 - parse 'a lovely cat holding a sign says 'flux.cpp'' to [['a lovely cat holding a sign says 'flux.cpp'', 1], ]
[DEBUG] clip.hpp:311 - token length: 77
[DEBUG] t5.hpp:397 - token length: 256
[DEBUG] ggml_extend.hpp:998 - t5 compute buffer size: 68.25 MB(RAM)
[DEBUG] conditioner.hpp:1155 - computing condition graph completed, taking 35379 ms
[INFO ] stable-diffusion.cpp:1256 - get_learned_condition completed, taking 35384 ms
[INFO ] stable-diffusion.cpp:1279 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1283 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:998 - flux compute buffer size: 398.50 MB(VRAM)
|==================================================| 20/20 - 1.10s/it
[INFO ] stable-diffusion.cpp:1315 - sampling completed, taking 42.70s
[INFO ] stable-diffusion.cpp:1323 - generating 1 latent images completed, taking 43.16s
[INFO ] stable-diffusion.cpp:1326 - decoding 1 latents
[DEBUG] ggml_extend.hpp:998 - vae compute buffer size: 1664.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:987 - computing vae [mode: DECODE] graph completed, taking 1.10s
[INFO ] stable-diffusion.cpp:1336 - latent 1 decoded, taking 1.10s
[INFO ] stable-diffusion.cpp:1340 - decode_first_stage completed, taking 1.10s
[INFO ] stable-diffusion.cpp:1449 - txt2img completed in 79.65s
save result image to 'output.png'
VRAM 使用量はまぁまぁなので、T4 とかでもいけそう。RAM 使用量は sd コマンドがシステム 16 GB に対する 60% くらい。
$ nvidia-smi
Mon Sep 9 23:33:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:00:04.0 Off | 0 |
| N/A 49C P0 30W / 72W | 12359MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
./build/bin/sd --diffusion-model models/flux1-dev-Q8_0.gguf --vae ./models/ae.safetensors --clip_l ./models/clip_l.safetensors --t5xxl ./models/t5xxl_fp16.safetensors -p "A muscular macho male warrior is holding a large sword. The sword is covered with flames." --cfg-scale 1.0 --sampling-method euler -v
A muscular macho male warrior wearing golden armor is holding a large sword. The sword is covered with flames. He is now fighting with a dragon.
以下は「Anime image. Comic image.」が全然反映されなかった。
Anime image. Comic image. A muscular macho male warrior wearing golden armor is holding a large sword. The sword is covered with flames. He is now fighting with a dragon.
「anime-style」にして「against」も使ってみた。
An anime-style muscular macho male warrior wearing golden armor is holding a large sword. The sword is covered with flames. He is now fighting against a dragon.
A winged goddess flies in the sky and watches over the earth.
FLUX.1
$ pip install bitsandbytes
Successfully installed bitsandbytes-0.43.3
sayakpaul/flux.1-dev-nf4-with-bnb-integration を参考に [Quantization] Add quantization support for bitsandbytes #9213 を見る。
$ pip install -U git+https://github.com/huggingface/diffusers@c795c82df39620e2576ccda765b6e67e849c36e7
Successfully installed diffusers-0.31.0.dev0
これで
from diffusers import BitsAndBytesConfig
ができるようになる。
Value error: This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed in order to use this tokenizer.
が出るケースがあったので、Value error : sentencepiece を参考に以下も実行しておいたほうが良さそう。
$ pip install sentencepiece
この時点でローカルストレージの残量を見ておく:
$ df -T
Filesystem Type 1K-blocks Used Available Use% Mounted on
shm tmpfs 65536 0 65536 0% /dev/shm
tmpfs tmpfs 32783196 0 32783196 0% /sys/fs/cgroup
/dev/sda2 ext4 479079112 49726888 404942832 11% /dev/termination-log
overlay overlay 103019448 47990824 55012240 47% /
FLUX.1 の続き:
%%time
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)
diffusion_pytorch_model.safetensors: 10% [.........] 682M/6.70G [00:21<02:49, 35.5MB/s]
torch.uint8
BitsAndBytesConfig {
"_load_in_4bit": true,
"_load_in_8bit": false,
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "uint8",
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": false,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"llm_int8_skip_modules": null,
"llm_int8_threshold": 6.0,
"load_in_4bit": true,
"load_in_8bit": false,
"quant_method": "bitsandbytes"
}CPU times: user 9.6 s, sys: 12.7 s, total: 22.3 s
Wall time: 1min 52s
$ df -T
...
overlay overlay 103019448 54533208 48469856 53% /
以下は Access to model black-forest-labs/FLUX.1-dev is restricted. で弾かれる。How do I get permission?#84 を参考に、https://huggingface.co/black-forest-labs/FLUX.1-dev で使用許可を得て、
from huggingface_hub import login
login(token="your_access_token")
を実行しておく。
%%time
model_id = "black-forest-labs/FLUX.1-dev"
pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
model_index.json: 100% [........] 536/536 [00:00<00:00, 79.1kB/s]
Fetching 18 files: 100% [........] 18/18 [01:57<00:00, 29.47s/it]
(…)t_encoder_2/model.safetensors.index.json: 100% [........] 19.9k/19.9k [00:00<00:00, 1.14MB/s]
scheduler/scheduler_config.json: 100% [........] 273/273 [00:00<00:00, 39.4kB/s]
tokenizer/merges.txt: 100% [........] 525k/525k [00:00<00:00, 1.11MB/s]
model.safetensors: 100% [........] 246M/246M [00:08<00:00, 30.1MB/s]
model-00002-of-00002.safetensors: 100% [........] 4.53G/4.53G [01:48<00:00, 31.9MB/s]
model-00001-of-00002.safetensors: 100% [........] 4.99G/4.99G [01:56<00:00, 68.4MB/s]
text_encoder/config.json: 100% [........] 613/613 [00:00<00:00, 65.2kB/s]
text_encoder_2/config.json: 100% [........] 782/782 [00:00<00:00, 91.7kB/s]
tokenizer/tokenizer_config.json: 100% [........] 705/705 [00:00<00:00, 64.0kB/s]
tokenizer/special_tokens_map.json: 100% [........] 588/588 [00:00<00:00, 60.1kB/s]
tokenizer_2/tokenizer.json: 100% [........] 2.42M/2.42M [00:00<00:00, 2.73MB/s]
spiece.model: 100% [........] 792k/792k [00:00<00:00, 7.57MB/s]
tokenizer/vocab.json: 100% [........] 1.06M/1.06M [00:00<00:00, 1.64MB/s]
tokenizer_2/special_tokens_map.json: 100% [........] 2.54k/2.54k [00:00<00:00, 183kB/s]
tokenizer_2/tokenizer_config.json: 100% [........] 20.8k/20.8k [00:00<00:00, 221kB/s]
vae/config.json: 100% [........] 820/820 [00:00<00:00, 62.8kB/s]
diffusion_pytorch_model.safetensors: 100% [........] 168M/168M [00:07<00:00, 22.8MB/s]
Loading pipeline components...: 100% [........] 7/7 [00:03<00:00, 2.18it/s]
Loading checkpoint shards: 100% [........] 2/2 [00:00<00:00, 8.60it/s]
You setadd_prefix_space. The tokenizer needs to be converted from the slow tokenizers
The module 'FluxTransformer2DModel' has been loaded inbitsandbytes4bit and moving it to cpu via.to()is not supported. Module is still on cuda:0. In most cases, it is recommended to not change the device.
CPU times: user 12.8 s, sys: 15.2 s, total: 28 s
Wall time: 2min 2s
$ df -T
...
overlay overlay 103019448 64243628 38759436 63% /
$ cd ~/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev
$ du -h
4.0K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/tokenizer
8.0K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/vae
4.0K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/scheduler
12K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/text_encoder_2
8.0K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/text_encoder
8.0K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/tokenizer_2
48K ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44
52K ./snapshots
8.0K ./refs
9.3G ./blobs
9.3G .
で
prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")
すると・・・
OutOfMemoryError Traceback (most recent call last)
FLUX.1(続き)
前述の内容だと VRAM 16GB だとちょっと難しいみたいなので別の方法にする。
前回のうち一部要らなくなっているので、
$ cd ~/.cache/huggingface/hub
$ rm -rf models--sayakpaul--flux.1-dev-nf4-with-bnb-integration
しておく。
以下を参考にする。
sayakpaul/flux.1-dev-nf4-with-bnb-integration の
To run
Check out sayakpaul/flux.1-dev-nf4-pkg that shows how to run this checkpoint along with an NF4 T5 in a free-tier Colab Notebook.
を見る。続けてリンク先にジャンプして、
Running Flux.1-dev under 12GBs
This repository contains the NF4 params for the T5 and transformer of Flux.1-Dev. Check out this Colab Notebook for details on how they were obtained.Check out this notebook that shows how to use the checkpoints and run in a free-tier Colab Notebook.
を参考にする。
from transformers import T5EncoderModel
from diffusers import FluxTransformer2DModel, FluxPipeline
import torch
import gc
def flush():
"""Wipes off memory."""
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"
flush()
して、VRAM を綺麗にしてから、
%%time
nf4_model_id = "sayakpaul/flux.1-dev-nf4-pkg"
text_encoder_2 = T5EncoderModel.from_pretrained(
nf4_model_id, subfolder="text_encoder_2", torch_dtype=torch.float16
)
transformer = FluxTransformer2DModel.from_pretrained(
nf4_model_id, subfolder="transformer", torch_dtype=torch.float16
)
text_encoder_2/config.json: 100% [........] 1.27k/1.27k [00:00<00:00, 273kB/s]
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
low_cpu_mem_usagewas None, now set to True since model is quantized.
model.safetensors: 100% [........] 6.33G/6.33G [06:16<00:00, 16.3MB/s]
transformer/config.json: 100% [........] 902/902 [00:00<00:00, 83.4kB/s]
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'diffusers.quantizers.quantization_config.BitsAndBytesConfig'>.
diffusion_pytorch_model.safetensors: 100% [........] 6.70G/6.70G [01:58<00:00, 38.7MB/s]
CPU times: user 26.1 s, sys: 30.6 s, total: 56.7 s
Wall time: 8min 20s
念のため、
$ cd ~/.cache/huggingface/hub/models--sayakpaul--flux.1-dev-nf4-pkg
$ $ du -h
8.0K ./snapshots/da15611221570dbf0ad4f620165f0072e443e3c4/transformer
8.0K ./snapshots/da15611221570dbf0ad4f620165f0072e443e3c4/text_encoder_2
20K ./snapshots/da15611221570dbf0ad4f620165f0072e443e3c4
24K ./snapshots
8.0K ./refs
4.0K ./.no_exist/da15611221570dbf0ad4f620165f0072e443e3c4/transformer
8.0K ./.no_exist/da15611221570dbf0ad4f620165f0072e443e3c4
12K ./.no_exist
13G ./blobs
13G .
続けて、以下。black-forest-labs/FLUX.1-dev は既にダウンロードしたのですぐ終わる。
%%time
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=text_encoder_2,
transformer=transformer,
torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
そしていよいよ生成。
%%time
prompt = "A mystic cat with a sign that says hello world!"
image = pipe(
prompt,
guidance_scale=3.5,
num_inference_steps=50,
generator=torch.manual_seed(0)
).images[0]
torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
print(f"{memory=} GB.")
image.save("flux-nf4-dev.png")
VRAM がかなりギリギリ・・・。
$ nvidia-smi
...
| 41% 61C P2 138W / 140W | 15098MiB / 16376MiB | 100% Default |
...
そして、結局
OutOfMemoryError: CUDA out of memory.
Colab なら事情が異なるのだろうか?と思いつつもエラーメッセージを参考に、
$ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
$ jupyter lab
したらいけた。相変わらずカツカツだけど少し余裕が出ているみたい。T4 / A4000 のようにピッタリ 16 GB の VRAM ならこの起動方法が良さそう。L4 のように 24 GB くらいあればどちらでも良さそう。
$ nvidia-smi
...
| 59% 80C P2 139W / 140W | 14566MiB / 16376MiB | 100% Default |
100% [........] 50/50 [01:38<00:00, 2.01s/it]
The module 'T5EncoderModel' has been loaded inbitsandbytes4bit and moving it to cpu via.to()is not supported. Module is still on cuda:0. In most cases, it is recommended to not change the device.
The module 'FluxTransformer2DModel' has been loaded inbitsandbytes4bit and moving it to cpu via.to()is not supported. Module is still on cuda:0. In most cases, it is recommended to not change the device.
memory='12.141' GB.
CPU times: user 1min, sys: 39.9 s, total: 1min 40s
Wall time: 1min 40s
まぁまぁ速い・・・?A4000 で 1min 40s くらいで、L4 で 1min 57s くらい。
結局、ストレージは以下のようになった。
$ df -T
...
overlay overlay 103019448 70426584 32576480 69% /