Stable Diffusion web UI セットアップ

sudo apt install --no-install-recommends google-perftools
sudo apt install bc

$ git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
$ cd stable-diffusion-webui

$ bash ./webui.sh

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################
...
Downloading: "https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors" to /home/xxx/yyy/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors

 85%|██████████████████████████████████████████████████████████████████            | 3.36G/3.97G [00:53<00:15, 42.0MB/s]

起動

ブラウザで localhost:7860 にアクセス。

データサイズ

$du -h
...
4.0G    ./models/Stable-diffusion
...
4.3G    .

derwind

Diffusers

モデルの本体

~/.cache/huggingface/hub/ にある。

$ ls -1 ~/.cache/huggingface/hub/
models--openai--clip-vit-large-patch14
models--runwayml--stable-diffusion-v1-5
models--stabilityai--stable-diffusion-xl-base-1.0
version.txt
version_diffusers_cache.txt

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

Add Stable Diffusion 3

$ pip install 'huggingface_hub[cli,torch]'
$ huggingface-cli login

してから以下でダウンロード。15 GB くらいのモデル。c.f. stabilityai/stable-diffusion-3-medium-diffusers

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"

pipeline = DiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
)
pipeline.to("cuda")

data did not match any variant of untagged enum PyPreTokenizerTypeWrapper

が出る場合は、transformers を更新 (>4.37.2)。c.f. SD3 - data did not match any variant of untagged enum PyPreTokenizerTypeWrapper, data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 960 column 3

OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU

How much VRAM ?

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"

pipeline = DiffusionPipeline.from_pretrained(
    model_id,
    text_encoder_3=None,
    tokenizer_3=None,
    torch_dtype=torch.float16,
)
pipeline.to("cuda")

ならいける:

6618MiB / 16376MiB

Passing scale via joint_attention_kwargs when not using the PEFT backend is ineffective.

pip install peft

c.f. how can we fix the message "Passing scale via joint_attention_kwargs when not using the PEFT backend is ineffective."

derwind

cagliostrolab/animagine-xl-3.1

sketch2lineart などで使われているモデル。

DiffusionPipeline.from_pretrained でロードさせようとした場合、https://huggingface.co/api/models/cagliostrolab/animagine-xl-3.1 から情報を取得してチェックポイントなどのダウンロードが始まる。これらのファイルのマスタは https://huggingface.co/cagliostrolab/animagine-xl-3.1/tree/main にある。

主に大きいのはノイズを予測するモデルとして使われる unet と、テキストエンコーダ CLIP で、それぞれ 5.14GB と 1.39GB くらいである。

$ cd ~/.cache/huggingface/hub/models--cagliostrolab--animagine-xl-3.1/blobs
$ ls -l

-rw-rw-r-- 1 user user 1389382176 Jun 30 17:30 b2f6541a64af35bb59231e19c8efa840bdb076554ee17d4b418c4197ff07fc87
-rw-rw-r-- 1 user user 5135149760 Jun 30 17:32 c1e43f5fa892e1c54c99fc7caebf9c3426910ea5a730861ff89dead23b9f260e

derwind

sketch2lineart

AIでラフを線画に整えるだけの無料webアプリ『sketch2lineart』公開が技術的に気になったので手元で動かしてみた。

https://huggingface.co/spaces/tori29umai/sketch2lineart

インストール

どうやら PyTorch 2.2 でないとうまく嚙み合わない？ようなので、

$ pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
$ pip install -r requirement.txt

起動

$ python app.py

gradio の一時的な URL が発行されて 72 時間だけ使えるようになるみたい。

サイズ

$ cd sketch2lineart
$ du -h
2.4G    ./controlnet
362M    ./tagger
82M     ./lora
24K     ./utils
...
2.8G    .

$ cd ~/.cache/huggingface/hub
$ du -h models--cagliostrolab--animagine-xl-3.1
6.4G    models--cagliostrolab--animagine-xl-3.1
$ du -h models--madebyollin--sdxl-vae-fp16-fix
320M    models--madebyollin--sdxl-vae-fp16-fix

初回変換時

必要なモデルがダウンロードされて ~/.cache 以下に配置される。

config.json: 100%|█████████████████████████████████████████████████████████████████████| 631/631 [00:00<00:00, 2.66MB/s]
diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████| 335M/335M [00:03<00:00, 92.7MB/s]
model_index.json: 100%|████████████████████████████████████████████████████████████████| 680/680 [00:00<00:00, 3.98MB/s]
text_encoder/config.json: 100%|████████████████████████████████████████████████████████| 560/560 [00:00<00:00, 6.09MB/s]
text_encoder_2/config.json: 100%|██████████████████████████████████████████████████████| 570/570 [00:00<00:00, 4.46MB/s]
tokenizer/tokenizer_config.json: 100%|█████████████████████████████████████████████████| 705/705 [00:00<00:00, 6.81MB/s]
scheduler/scheduler_config.json: 100%|█████████████████████████████████████████████████| 464/464 [00:00<00:00, 4.31MB/s]
tokenizer/special_tokens_map.json: 100%|███████████████████████████████████████████████| 588/588 [00:00<00:00, 5.20MB/s]
tokenizer_2/tokenizer_config.json: 100%|███████████████████████████████████████████████| 856/856 [00:00<00:00, 5.86MB/s]
tokenizer_2/special_tokens_map.json: 100%|█████████████████████████████████████████████| 462/462 [00:00<00:00, 3.76MB/s]
tokenizer/merges.txt: 100%|██████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 1.11MB/s]
unet/config.json: 100%|████████████████████████████████████████████████████████████| 1.77k/1.77k [00:00<00:00, 11.3MB/s]
tokenizer/vocab.json: 100%|████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.81MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████| 246M/246M [00:06<00:00, 35.7MB/s]
diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████| 5.14G/5.14G [00:19<00:00, 264MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████| 1.39G/1.39G [01:00<00:00, 22.9MB/s]
Fetching 16 files: 100%|████████████████████████████████████████████████████████████████| 16/16 [01:00<00:00,  3.81s/it]

変換

Loading pipeline components...: 100%|█████████████████████████████████████████████████████| 7/7 [00:01<00:00,  5.70it/s]
masterpiece, best quality, monochrome, greyscale, lineart, white background
100%|███████████████████████████████████████████████████████████████████████████████████| 30/30 [00:45<00:00,  1.51s/it]
Time taken: 53.10483503341675

変換サンプル

これは凄い・・・

ControlNet と LoRA を用いているようだが、LoRA のほうで線画生成を担当させているのかもしれない。

derwind

gsdf/Counterfeit-V2.5

vae

convert_vae_pt_to_diffusers.py の実装にしても、AutoencoderKL.from_single_file にしても pytorch-lightning をインストールしていないと実行時例外が出る。とりあえず以下で動いた・・・。

$ pip list | grep pytorch-lightning
pytorch-lightning         2.3.1

pip install -qU pytorch-lightning

$ curl -LO https://huggingface.co/gsdf/Counterfeit-V2.5/resolve/main/Counterfeit-V2.5.vae.pt
$ curl -LO https://raw.githubusercontent.com/huggingface/diffusers/main/scripts/convert_vae_pt_to_diffusers.py
$ python convert_vae_pt_to_diffusers.py --vae_pt_path Counterfeit-V2.5.vae.pt --dump_path counterfeit_vae

この変換はせずに

$ curl -LO https://huggingface.co/gsdf/Counterfeit-V2.5/resolve/main/vae/config.json
$ curl -LO https://huggingface.co/gsdf/Counterfeit-V2.5/resolve/main/vae/diffusion_pytorch_model.safetensors

が良いかも。

import torch
from diffusers import (
    StableDiffusionPipeline,
    AutoencoderKL,
)

して

hub_dir = Path(os.getenv("HOME"))/".cache/huggingface/hub"
vae_st = str(hub_dir/"models--gsdf--Counterfeit-V2.5-vae/diffusion_pytorch_model.safetensors")
vae = AutoencoderKL.from_single_file(vae_st, torch_dtype=torch.float16).to_empty(device="cuda")

みたいなので VAE は作れるのだが、StableDiffusionPipeline に与えても真っ黒画像が出てきたのでよく分からない。本体側の U-Net と噛み合っていないか？または自動で blob から拾って適用しているのかもしれない・・・。実際 pipe.vae を見ると中身に AutoencoderKL が入っている。

pipe = StableDiffusionPipeline.from_pretrained(
    "gsdf/Counterfeit-V2.5",
    #vae=vae,
    torch_dtype=torch.float16
).to("cuda")

pipe.load_textual_inversion(
    "gsdf/Counterfeit-V3.0",
    weight_name="embedding/EasyNegativeV2.safetensors", 
    token="EasyNegativeV2"
)

result = pipe(prompt="girl eating pizza", negative_prompt="EasyNegativeV2", width=512, height=512, num_inference_steps=200)

display(result.images[0])

derwind

gsdf/Counterfeit-V3.0

model_index.json が置いていないので StableDiffusionPipeline.from_pretrained が使えない。代わりに safetensors をダウンロードしてローカルで読み込む。VAE を使う必要はない？と言うか V2.5 の VAE の連携も分からなかったので一旦無視。

$ mkdir ~/.cache/huggingface/hub/models--gsdf--Counterfeit-V3.0
$ cd ~/.cache/huggingface/hub/models--gsdf--Counterfeit-V3.0
$ curl -LO https://huggingface.co/gsdf/Counterfeit-V3.0/resolve/main/Counterfeit-V3.0_fix_fp16.safetensors

恐らく以下のように使えば良いと思う:

from __future__ import annotations

import os
from pathlib import Path

import torch
from diffusers import StableDiffusionPipeline


hub_dir = Path(os.getenv("HOME"))/".cache/huggingface/hub"
model = str(hub_dir/"models--gsdf--Counterfeit-V3.0/Counterfeit-V3.0_fix_fp16.safetensors")

pipe = StableDiffusionPipeline.from_single_file(
    model,
    text_encoder_3=None,
    tokenizer_3=None,
    variant="fp16",
    torch_dtype=torch.float16
).to("cuda")

pipe.load_textual_inversion(
    "gsdf/Counterfeit-V3.0",
    weight_name="embedding/EasyNegativeV2.safetensors", 
    token="EasyNegativeV2"
)

result = pipe(
    prompt="girl eating pizza",
    negative_prompt="EasyNegativeV2",
    width=512,
    height=512,
    num_inference_steps=200
)

display(result.images[0])

derwind

お気持ち駆動 Langevin 動力学

Langevin 動力学 (Wikipedia)

分子系の動力学の数理モデリングのための手法

Langevin 方程式 (Wikipedia)

\begin{align*} m \frac{d \bm{v}}{dt} = - \beta \bm{v} + \eta (t) \end{align*}

右辺第一項は粘性力、第二項はランダム力。

より一般的な表記なのかは分からないが、非平衡統計力学では、

\begin{align*} m \frac{d \bm{v}}{dt} = - \gamma \bm{v} + F(\bm{x} (t), t) + g \xi (t) \tag{B.2} \end{align*}

という式が書かれている。水中のコロイド粒子などのケースでは、上記は近似的に

\begin{align*} \gamma \frac{d \bm{x}}{dt} = F(\bm{x} (t), t) + g \xi (t) \tag{B.6} \end{align*}

という “overdamped Langevin equation” と呼ばれるものになるそうで、SDE の形で表した (B.16) は直下の arXiv の SDE とよく符合している。

arXiv:2011.13456 Score-Based Generative Modeling through Stochastic Differential Equations

Score-Based Generative Modeling through Stochastic Differential Equations

順過程（拡散過程）

上記の overdamped Langevin equation の SDE 表記の式が書かれている:

\begin{align*} d \bm{x} = \bm{f} (\bm{x}, t) dt + g(t) d \bm{w} \tag{5} \end{align*}

ここで $\bm{w}$ は Wiener 過程。 $\bm{f} (\cdot, t)$ はドリフト係数、 $g (\cdot)$ は拡散係数。

逆過程（逆拡散過程）

\begin{align*} d \bm{x} = [\bm{f} (\bm{x}, t) - g (t)^2 \nabla_{\bm{x}} \log_{\bm{x}} p_t (x)] dt + g(t) d \tilde{\bm{w}} \tag{6} \end{align*}

ここで $\tilde{\bm{w}}$ は Wiener 過程。

DPM / DDPM

この形の類似性は、DPM 論文 arXiv:1503.03585 Deep Unsupervised Learning using Nonequilibrium Thermodynamics の

Kolmogorov の前進・後退方程式 (Feller, 1949) は，多くの前進拡散過程について，逆拡散過程も同じ関数形で記述できることを示している．

と符合していそうだ。Sec. 2 では

我々の目標は，任意の複雑なデータ分布を単純で扱いやすい分布に変換する前進（または推論）拡散過程を定義し，生成モデル分布を定義するこの拡散過程の有限時間反転を学習することである（Figure 1 を参照）．

とある。DDPM 論文では拡散率 $\beta_t$ に条件を課すことで、“単純で扱いやすい分布” がガウス分布になることが示されている: つまり、arXiv:2006.11239
Denoising Diffusion Probabilistic Models

\begin{align*} q (\bm{x}_t | \bm{x}_0) = \mathcal{N} (\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x_0}, (1 - \bar{\alpha}_t) \bold{I}) \tag{4} \end{align*}

で、十分大きな $t$ に対して、 $\bar{\alpha}_t \simeq 0$ となるようにとると、 $\bm{x}_t \sim q (\bm{x}_t | \bm{x}_0) \sim \mathcal{N} (\bm{x}_t; \bold{0}, \bold{I})$ となる。

更新式

順過程の式は

\begin{align*} \bm{x}_{t+1} = \bm{x}_{t} + \bm{f} (\bm{x}_{t}, t) + g(t) d \bm{w} \end{align*}

という更新式としての解釈が出来そうである。
真偽は不明だが、ChatGPT 曰く「形式的には $d \bm{w} (t) \sim \mathcal{N} (0, d t)$ という形で $d \bm{w} (t)$ を正規分布に従う無限小変化として扱うこともある」らしいので、これが正当化できるなら、離散化を考えて $d t \equiv 1$ とすると、 $d \bm{w} (t) \sim \mathcal{N} (0, 1)$ というガウスノイズを加えそうなお気持ちが生まれてくる。

逆過程の式も、 $\nabla_{\bm{x}} \log_{\bm{x}} p_t (x)$ はスコア関数なので、定数倍を除いてノイズ推定器 $\epsilon_\theta (\bm{x}_t, t)$ に一致すると思うことにすると、なんとなく U-Net で実装できそうなお気持ちがでてくる。

スコア関数

c.f Estimation of Non-Normalized Statistical Models by Score Matching (The Journal of Machine Learning Research, Volume 6 Pages 695 - 709)

正規化されていない分布 $\phi (\bm{x})$ を正規化して得る確率分布（確率密度関数）

\begin{align*} p (\bm{x}) = \frac{\phi (\bm{x})}{Z} \end{align*}

というモデルを考える時に、 $Z$ の計算が事実上無理な場合、

\begin{align*} \nabla_{\bm{x}} \log p (\bm{x}) &= \nabla_{\bm{x}} (\log \phi (\bm{x}) - \log Z) \\ &= \nabla_{\bm{x}} \log \phi (\bm{x}) \end{align*}

を計算すると $Z$ が消えて嬉しい。この値を “スコア” と呼ぶ。

Kolmogorov の後退方程式・前進方程式

On the Theory of Stochastic Processes, with Particular Reference to Applications

不連続過程の前進方程式と後退方程式は，物理学者にはおなじみの，いわゆるFokker-Planck 方程式 $u (\tau, \xi; t, x)$ に対応するものであり，Kolmogoroff が一般理論の基礎を築く際に手本としたものである．

前進方程式

\begin{align*} \begin{split} u_t (\tau, \xi; t, x) &= \frac{1}{2} [a (t, x) u (\tau, \xi; t, x)]_{xx} + [b (t, x) u (\tau, \xi; t, x)]_x \\ &= \frac{1}{2} \frac{\partial^2}{\partial x^2} [a (t, x) u (\tau, \xi; t, x)] + \frac{\partial}{\partial x} [b (t, x) u (\tau, \xi; t, x)] \end{split} \tag{76} \end{align*}

後退方程式

\begin{align*} \begin{split} u_{\tau} (\tau, \xi; t, x) &= \frac{1}{2} a (\tau, x) u_{\xi \xi} (\tau, \xi; t, x) + b (\tau, x) u_{\xi} (\tau, \xi; t, x) \\ &= \frac{1}{2} a (\tau, x) \frac{\partial^2 u}{\partial \xi^2} (\tau, \xi; t, x) + b (\tau, x) \frac{\partial u}{\partial \xi} (\tau, \xi; t, x) \end{split} \tag{77} \end{align*}

これらの最もよく知られた解は均一拡散の場合で、 $b (t, x) = 0$ , $a (t, x) = 2$ で、

\begin{align*} u (\tau, \xi; t, x) = \frac{1}{2 \{ \pi (t - \tau) \}^{\frac{1}{2}}} \exp \left\{ - \frac{(x - \xi)^2}{4 (t - \tau)} \right\} \tag{79} \end{align*}

らしい。 $A^* = \frac{1}{2} \frac{\partial^2}{\partial x^2} a (t, x) + \frac{\partial}{\partial x} b (t, x)$ および $A = \frac{1}{2} a (\tau, x) \frac{\partial^2}{\partial \xi^2} + b (\tau, x) \frac{\partial}{\partial \xi}$ とおくと、 $A^*$ と $A$ は形式的には共役の関係となるので、モダンに書くと以下のようになるのであろう・・・

確率微分方程式

（時間的に一様な）Kolmogorov の前進方程式

\begin{align*} \frac{\partial u}{\partial t} = A^* u (t, x), \quad t > 0, x \in \R^d \tag{5.24} \end{align*}

（時間的に一様な）Kolmogorov の後退方程式

\begin{align*} \frac{\partial u}{\partial t} &= Au + Vu +g, \quad t > 0, x \in \R^d \tag{5.27} \\ u (0, x) &= f(x), \quad x \in \R^d \tag{5.28} \end{align*}

※ 一種の「非線型拡散方程式」？ $V = 0$ , $g = 0$ として、 $A = \partial_i (a_{ij} \partial_j)$ のような楕円型、特に $A = \Delta$ とすると、拡散方程式になる。

arXiv:cond-mat/9707325 Equilibrium free energy differences from nonequilibrium measurements: a master equation approach

C. Jarzynski の arXiv:cond-mat/9707325 Equilibrium free energy differences from nonequilibrium measurements: a master equation approach を見ると、B. Langevin evolution において、

次に，ハミルトン方程式を修正し，摩擦力と確率的な力の両方を加えることを考えてみよう:

\begin{align*} \dot{x} &= p \tag{31} \\ \dot{p} &= - \partial_x V_{\lambda} - \gamma p + \tilde{F} (t) \tag{32} \end{align*}

という式が出てくる。これのことを DPM 論文では「Jarzynski 方程式」と呼んでいそうである。

derwind

訓練

Train a diffusion model を参考にする。

$ open-clip-torch
$ pip install -U "diffusers[training]"
Successfully installed absl-py-2.1.0 accelerate-0.32.1 markdown-3.6 protobuf-3.20.3 tensorboard-2.17.0 tensorboard-data-server-0.7.2 werkzeug-3.0.3

くらいで良い？

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
protoc-gen-openapiv2 0.0.1 requires protobuf>=4.21.0, but you have protobuf 3.20.3 which is incompatible.

と出るので追加で:

$ pip install -U "protobuf<5"

チュートリアルのままの訓練でも結構 GPU のメモリは食う:

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:51:00.0 Off |                  Off |
| 67%   87C    P2   139W / 140W |  10746MiB / 16376MiB |     97%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

derwind

city96/FLUX.1-dev-gguf

量子化された FLUX.1 を使いたい。

$ mkdir ~/.cache/huggingface/hub/models--city96--FLUX.1-dev-gguf
$ cd ~/.cache/huggingface/hub/models--city96--FLUX.1-dev-gguf
$ curl -LO https://huggingface.co/city96/FLUX.1-dev-gguf/resolve/main/flux1-dev-Q8_0.gguf

以下はダメ。torch 用ではないので。stable-diffusion.cpp でロードできるらしい。

hub_dir = Path(os.getenv("HOME"))/".cache/huggingface/hub"
model = str(hub_dir/"models--city96--FLUX.1-dev-gguf/flux1-dev-Q8_0.gguf")

pipeline = StableDiffusionPipeline.from_single_file(
    model,
    #text_encoder_3=None,
    #tokenizer_3=None,
    torch_dtype=torch.float16,
).to("cuda")

ということで

$ ldconfig -p | grep openblas

で OpenBLAS の存在を確認して、もしもなければ

$ sudo apt-get install libopenblas-dev

そして、

$ sudo apt install ccache
$ git clone --recursive https://github.com/leejet/stable-diffusion.cpp
$ cd stable-diffusion.cpp
$ mkdir build
$ cd build
$ cmake .. -DGGML_OPENBLAS=ON
$ cmake --build . --config Release

或いは以下で確認して cuBLAS が使えそうなら

$ cat /usr/local/cuda/include/cublas_api.h | grep CUBLAS_VER

以下のように:

$ sudo apt install ccache
$ git clone --recursive https://github.com/leejet/stable-diffusion.cpp
$ cd stable-diffusion.cpp
$ mkdir build
$ cd build
$ cmake .. -DSD_CUBLAS=ON
$ cmake --build . --config Release

ビルドは 1 時間以上かかったと思う。

$ cmake --build . --config Release
[  1%] Building C object thirdparty/CMakeFiles/zip.dir/zip.c.o
In file included from /home/xxx/git_work/stable-diffusion.cpp/thirdparty/zip.c:40:
/home/xxx/git_work/stable-diffusion.cpp/thirdparty/miniz.h:4988:9: note: ‘#pragma message: Using fopen, ftello, fseeko, stat() etc. path for file I/O - this path may not support large files.’
 4988 | #pragma message(                                                               \
      |         ^~~~~~~
[  1%] Built target zip
[  2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[  5%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[  6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[  7%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
[  8%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/arange.cu.o
[ 10%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/argsort.cu.o
[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/binbcast.cu.o
[ 12%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/clamp.cu.o
[ 14%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/concat.cu.o
[ 15%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/conv-transpose-1d.cu.o
[ 16%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/convert.cu.o
[ 17%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/cpy.cu.o
[ 19%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/cross-entropy-loss.cu.o
[ 20%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/diagmask.cu.o
[ 21%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/dmmv.cu.o
[ 23%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-tile-f16.cu.o
[ 24%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-tile-f32.cu.o
[ 25%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.o
[ 26%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/getrows.cu.o
[ 28%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/im2col.cu.o
[ 29%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/mmq.cu.o
[ 30%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/mmvq.cu.o
[ 32%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/norm.cu.o
[ 33%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/pad.cu.o
[ 34%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/pool2d.cu.o
[ 35%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/quantize.cu.o
[ 37%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/rope.cu.o
[ 38%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/scale.cu.o
[ 39%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/softmax.cu.o
[ 41%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/sumrows.cu.o
[ 42%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/tsembd.cu.o
[ 43%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/unary.cu.o
[ 44%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/upscale.cu.o
[ 46%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda.cu.o
/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
  static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
  static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
  static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
  static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
  static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
  static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2436): warning #177-D: function "set_ggml_graph_node_properties" was declared but never referenced
  static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/xxx/git_work/stable-diffusion.cpp/ggml/src/ggml-cuda.cu(2448): warning #177-D: function "ggml_graph_node_has_matching_properties" was declared but never referenced
  static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
              ^

[ 47%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.cu.o
[ 48%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.cu.o
[ 50%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.cu.o
[ 51%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.cu.o
[ 52%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.cu.o
[ 53%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq1_s.cu.o
[ 55%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_s.cu.o
[ 56%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_xs.cu.o
[ 57%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu.o
[ 58%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_s.cu.o
[ 60%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu.o
[ 61%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_nl.cu.o
[ 62%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_xs.cu.o
[ 64%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q2_k.cu.o
[ 65%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q3_k.cu.o
[ 66%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_0.cu.o
[ 67%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_1.cu.o
[ 69%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_k.cu.o
[ 70%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_0.cu.o
[ 71%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_1.cu.o
[ 73%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_k.cu.o
[ 74%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q6_k.cu.o
[ 75%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q8_0.cu.o
[ 76%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.cu.o
[ 78%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.cu.o
[ 79%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.cu.o
[ 80%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.cu.o
[ 82%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.cu.o
[ 83%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.cu.o
[ 84%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.cu.o
[ 85%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.cu.o
[ 87%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.cu.o
[ 88%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.cu.o
[ 89%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
[ 91%] Linking CUDA static library libggml.a
[ 91%] Built target ggml
[ 92%] Building CXX object CMakeFiles/stable-diffusion.dir/model.cpp.o
[ 93%] Building CXX object CMakeFiles/stable-diffusion.dir/stable-diffusion.cpp.o
[ 94%] Building CXX object CMakeFiles/stable-diffusion.dir/upscaler.cpp.o
[ 96%] Building CXX object CMakeFiles/stable-diffusion.dir/util.cpp.o
[ 97%] Linking CXX static library libstable-diffusion.a
[ 97%] Built target stable-diffusion
[ 98%] Building CXX object examples/cli/CMakeFiles/sd.dir/main.cpp.o
[100%] Linking CXX executable ../../bin/sd
[100%] Built target sd

「FLUX.1」ローカル環境の使い方！Stable Diffusion WebUI ForgeやComifyUIで画像生成を参考にモデル類をダウンロード。

但し、モデル本体は city96/FLUX.1-dev-gguf を使う。

$ mkdir models
$ cd models
$ curl -LO https://huggingface.co/city96/FLUX.1-dev-gguf/resolve/main/flux1-dev-Q8_0.gguf
$ curl -LO https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/ae.safetensors
$ curl -LO https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/clip_l.safetensors
$ curl -LO https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors

$ du -h
22G     .

derwind

city96/FLUX.1-dev-gguf（続き）

stable-diffusion.cpp で CPU で FLUX.1 Schnell を動かすを参考に生成:

$ ./build/bin/sd --diffusion-model  models/flux1-dev-Q8_0.gguf --vae ./models/ae.safetensors --clip_l ./models/clip_l.safetensors --t5xxl ./models/t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v

Option:
    n_threads:         4
    mode:              txt2img
    model_path:
    wtype:             unspecified
    clip_l_path:       ./models/clip_l.safetensors
    t5xxl_path:        ./models/t5xxl_fp16.safetensors
    diffusion_model_path:   models/flux1-dev-Q8_0.gguf
    vae_path:          ./models/ae.safetensors
    taesd_path:
    esrgan_path:
    controlnet_path:
    embeddings_path:
    stacked_id_embeddings_path:
    input_id_images_path:
    style ratio:       20.00
    normalize input image :  false
    output_path:       output.png
    init_img:
    control_image:
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            a lovely cat holding a sign says 'flux.cpp'
    negative_prompt:
    min_cfg:           1.00
    cfg_scale:         1.00
    guidance:          3.50
    clip_skip:         -1
    width:             512
    height:            512
    sample_method:     euler
    schedule:          default
    sample_steps:      20
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 1
    AVX512_VBMI = 0
    AVX512_VNNI = 1
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:157  - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:202  - loading clip_l from './models/clip_l.safetensors'
[INFO ] model.cpp:793  - load ./models/clip_l.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from './models/clip_l.safetensors'
[INFO ] stable-diffusion.cpp:209  - loading t5xxl from './models/t5xxl_fp16.safetensors'
[INFO ] model.cpp:793  - load ./models/t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from './models/t5xxl_fp16.safetensors'
[INFO ] stable-diffusion.cpp:216  - loading diffusion model from 'models/flux1-dev-Q8_0.gguf'
[INFO ] model.cpp:790  - load models/flux1-dev-Q8_0.gguf using gguf format
[DEBUG] model.cpp:807  - init from 'models/flux1-dev-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:223  - loading vae from './models/ae.safetensors'
[INFO ] model.cpp:793  - load ./models/ae.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from './models/ae.safetensors'
[INFO ] stable-diffusion.cpp:235  - Version: Flux Dev
[INFO ] stable-diffusion.cpp:266  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: q8_0
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             f32
[DEBUG] stable-diffusion.cpp:271  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:310  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:313  - CLIP: Using CPU backend
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1046 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1046 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1046 - flux params backend buffer size =  12068.09 MB(VRAM) (780 tensors)
[DEBUG] ggml_extend.hpp:1046 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:398  - loading weights
[DEBUG] model.cpp:1530 - loading tensors from ./models/clip_l.safetensors
[DEBUG] model.cpp:1530 - loading tensors from ./models/t5xxl_fp16.safetensors
[INFO ] model.cpp:1685 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
[DEBUG] model.cpp:1530 - loading tensors from models/flux1-dev-Q8_0.gguf
[DEBUG] model.cpp:1530 - loading tensors from ./models/ae.safetensors
[INFO ] stable-diffusion.cpp:482  - total params memory size = 21481.50MB (VRAM 12162.66MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 12068.09MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:501  - loading model from '' completed, taking 95.26s
[INFO ] stable-diffusion.cpp:518  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:572  - finished loaded file
[DEBUG] stable-diffusion.cpp:1378 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1127 - prompt after extract and remove lora: "a lovely cat holding a sign says 'flux.cpp'"
[INFO ] stable-diffusion.cpp:655  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1036 - parse 'a lovely cat holding a sign says 'flux.cpp'' to [['a lovely cat holding a sign says 'flux.cpp'', 1], ]
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] t5.hpp:397  - token length: 256
[DEBUG] ggml_extend.hpp:998  - t5 compute buffer size: 68.25 MB(RAM)
[DEBUG] conditioner.hpp:1155 - computing condition graph completed, taking 35379 ms
[INFO ] stable-diffusion.cpp:1256 - get_learned_condition completed, taking 35384 ms
[INFO ] stable-diffusion.cpp:1279 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1283 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:998  - flux compute buffer size: 398.50 MB(VRAM)
  |==================================================| 20/20 - 1.10s/it
[INFO ] stable-diffusion.cpp:1315 - sampling completed, taking 42.70s
[INFO ] stable-diffusion.cpp:1323 - generating 1 latent images completed, taking 43.16s
[INFO ] stable-diffusion.cpp:1326 - decoding 1 latents
[DEBUG] ggml_extend.hpp:998  - vae compute buffer size: 1664.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:987  - computing vae [mode: DECODE] graph completed, taking 1.10s
[INFO ] stable-diffusion.cpp:1336 - latent 1 decoded, taking 1.10s
[INFO ] stable-diffusion.cpp:1340 - decode_first_stage completed, taking 1.10s
[INFO ] stable-diffusion.cpp:1449 - txt2img completed in 79.65s
save result image to 'output.png'

VRAM 使用量はまぁまぁなので、T4 とかでもいけそう。RAM 使用量は sd コマンドがシステム 16 GB に対する 60% くらい。

$ nvidia-smi
Mon Sep  9 23:33:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P0             30W /   72W |   12359MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

./build/bin/sd --diffusion-model  models/flux1-dev-Q8_0.gguf --vae ./models/ae.safetensors --clip_l ./models/clip_l.safetensors --t5xxl ./models/t5xxl_fp16.safetensors  -p "A muscular macho male warrior is holding a large sword. The sword is covered with flames." --cfg-scale 1.0 --sampling-method euler -v

A muscular macho male warrior wearing golden armor is holding a large sword. The sword is covered with flames. He is now fighting with a dragon.

以下は「Anime image. Comic image.」が全然反映されなかった。

Anime image. Comic image. A muscular macho male warrior wearing golden armor is holding a large sword. The sword is covered with flames. He is now fighting with a dragon.

「anime-style」にして「against」も使ってみた。

An anime-style muscular macho male warrior wearing golden armor is holding a large sword. The sword is covered with flames. He is now fighting against a dragon.

A winged goddess flies in the sky and watches over the earth.

derwind

FLUX.1

$ pip install bitsandbytes
Successfully installed bitsandbytes-0.43.3

sayakpaul/flux.1-dev-nf4-with-bnb-integration を参考に [Quantization] Add quantization support for bitsandbytes #9213 を見る。

$ pip install -U git+https://github.com/huggingface/diffusers@c795c82df39620e2576ccda765b6e67e849c36e7
Successfully installed diffusers-0.31.0.dev0

これで

from diffusers import BitsAndBytesConfig

ができるようになる。

Value error: This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed in order to use this tokenizer.

が出るケースがあったので、Value error : sentencepiece を参考に以下も実行しておいたほうが良さそう。

$ pip install sentencepiece

この時点でローカルストレージの残量を見ておく:

$ df -T
Filesystem     Type    1K-blocks     Used Available Use% Mounted on
shm            tmpfs       65536        0     65536   0% /dev/shm
tmpfs          tmpfs    32783196        0  32783196   0% /sys/fs/cgroup
/dev/sda2      ext4    479079112 49726888 404942832  11% /dev/termination-log
overlay        overlay 103019448 47990824  55012240  47% /

FLUX.1 の続き:

%%time

nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

diffusion_pytorch_model.safetensors: 10% [.........] 682M/6.70G [00:21<02:49, 35.5MB/s]
torch.uint8
BitsAndBytesConfig {
"_load_in_4bit": true,
"_load_in_8bit": false,
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "uint8",
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": false,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"llm_int8_skip_modules": null,
"llm_int8_threshold": 6.0,
"load_in_4bit": true,
"load_in_8bit": false,
"quant_method": "bitsandbytes"
}

CPU times: user 9.6 s, sys: 12.7 s, total: 22.3 s
Wall time: 1min 52s

$ df -T
...
overlay        overlay 103019448 54533208  48469856  53% /

以下は Access to model black-forest-labs/FLUX.1-dev is restricted. で弾かれる。How do I get permission?#84 を参考に、https://huggingface.co/black-forest-labs/FLUX.1-dev で使用許可を得て、

from huggingface_hub import login
login(token="your_access_token")

を実行しておく。

%%time

model_id = "black-forest-labs/FLUX.1-dev"
pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

model_index.json: 100% [........] 536/536 [00:00<00:00, 79.1kB/s]
Fetching 18 files: 100% [........] 18/18 [01:57<00:00, 29.47s/it]
(…)t_encoder_2/model.safetensors.index.json: 100% [........] 19.9k/19.9k [00:00<00:00, 1.14MB/s]
scheduler/scheduler_config.json: 100% [........] 273/273 [00:00<00:00, 39.4kB/s]
tokenizer/merges.txt: 100% [........] 525k/525k [00:00<00:00, 1.11MB/s]
model.safetensors: 100% [........] 246M/246M [00:08<00:00, 30.1MB/s]
model-00002-of-00002.safetensors: 100% [........] 4.53G/4.53G [01:48<00:00, 31.9MB/s]
model-00001-of-00002.safetensors: 100% [........] 4.99G/4.99G [01:56<00:00, 68.4MB/s]
text_encoder/config.json: 100% [........] 613/613 [00:00<00:00, 65.2kB/s]
text_encoder_2/config.json: 100% [........] 782/782 [00:00<00:00, 91.7kB/s]
tokenizer/tokenizer_config.json: 100% [........] 705/705 [00:00<00:00, 64.0kB/s]
tokenizer/special_tokens_map.json: 100% [........] 588/588 [00:00<00:00, 60.1kB/s]
tokenizer_2/tokenizer.json: 100% [........] 2.42M/2.42M [00:00<00:00, 2.73MB/s]
spiece.model: 100% [........] 792k/792k [00:00<00:00, 7.57MB/s]
tokenizer/vocab.json: 100% [........] 1.06M/1.06M [00:00<00:00, 1.64MB/s]
tokenizer_2/special_tokens_map.json: 100% [........] 2.54k/2.54k [00:00<00:00, 183kB/s]
tokenizer_2/tokenizer_config.json: 100% [........] 20.8k/20.8k [00:00<00:00, 221kB/s]
vae/config.json: 100% [........] 820/820 [00:00<00:00, 62.8kB/s]
diffusion_pytorch_model.safetensors: 100% [........] 168M/168M [00:07<00:00, 22.8MB/s]
Loading pipeline components...: 100% [........] 7/7 [00:03<00:00, 2.18it/s]
Loading checkpoint shards: 100% [........] 2/2 [00:00<00:00, 8.60it/s]
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
The module 'FluxTransformer2DModel' has been loaded in bitsandbytes 4bit and moving it to cpu via .to() is not supported. Module is still on cuda:0. In most cases, it is recommended to not change the device.
CPU times: user 12.8 s, sys: 15.2 s, total: 28 s
Wall time: 2min 2s

$ df -T
...
overlay        overlay 103019448 64243628  38759436  63% /

$ cd ~/.cache/huggingface/hub/models--black-forest-labs--FLUX.1-dev
$ du -h
4.0K    ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/tokenizer
8.0K    ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/vae
4.0K    ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/scheduler
12K     ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/text_encoder_2
8.0K    ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/text_encoder
8.0K    ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44/tokenizer_2
48K     ./snapshots/0ef5fff789c832c5c7f4e127f94c8b54bbcced44
52K     ./snapshots
8.0K    ./refs
9.3G    ./blobs
9.3G    .

で

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")

すると・・・

OutOfMemoryError Traceback (most recent call last)

derwind

FLUX.1（続き）

前述の内容だと VRAM 16GB だとちょっと難しいみたいなので別の方法にする。
前回のうち一部要らなくなっているので、

$ cd ~/.cache/huggingface/hub
$ rm -rf models--sayakpaul--flux.1-dev-nf4-with-bnb-integration

しておく。

以下を参考にする。

sayakpaul/flux.1-dev-nf4-with-bnb-integration の

To run
Check out sayakpaul/flux.1-dev-nf4-pkg that shows how to run this checkpoint along with an NF4 T5 in a free-tier Colab Notebook.

を見る。続けてリンク先にジャンプして、

Running Flux.1-dev under 12GBs
This repository contains the NF4 params for the T5 and transformer of Flux.1-Dev. Check out this Colab Notebook for details on how they were obtained.

Check out this notebook that shows how to use the checkpoints and run in a free-tier Colab Notebook.

を参考にする。

from transformers import T5EncoderModel
from diffusers import FluxTransformer2DModel, FluxPipeline
import torch
import gc


def flush():
    """Wipes off memory."""
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"

flush()

して、VRAM を綺麗にしてから、

%%time

nf4_model_id = "sayakpaul/flux.1-dev-nf4-pkg"
text_encoder_2 = T5EncoderModel.from_pretrained(
    nf4_model_id, subfolder="text_encoder_2", torch_dtype=torch.float16
)
transformer = FluxTransformer2DModel.from_pretrained(
    nf4_model_id, subfolder="transformer", torch_dtype=torch.float16
)

text_encoder_2/config.json: 100% [........] 1.27k/1.27k [00:00<00:00, 273kB/s]
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
low_cpu_mem_usage was None, now set to True since model is quantized.
model.safetensors: 100% [........] 6.33G/6.33G [06:16<00:00, 16.3MB/s]
transformer/config.json: 100% [........] 902/902 [00:00<00:00, 83.4kB/s]
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'diffusers.quantizers.quantization_config.BitsAndBytesConfig'>.
diffusion_pytorch_model.safetensors: 100% [........] 6.70G/6.70G [01:58<00:00, 38.7MB/s]
CPU times: user 26.1 s, sys: 30.6 s, total: 56.7 s
Wall time: 8min 20s

念のため、

$ cd ~/.cache/huggingface/hub/models--sayakpaul--flux.1-dev-nf4-pkg
$ $ du -h
8.0K    ./snapshots/da15611221570dbf0ad4f620165f0072e443e3c4/transformer
8.0K    ./snapshots/da15611221570dbf0ad4f620165f0072e443e3c4/text_encoder_2
20K     ./snapshots/da15611221570dbf0ad4f620165f0072e443e3c4
24K     ./snapshots
8.0K    ./refs
4.0K    ./.no_exist/da15611221570dbf0ad4f620165f0072e443e3c4/transformer
8.0K    ./.no_exist/da15611221570dbf0ad4f620165f0072e443e3c4
12K     ./.no_exist
13G     ./blobs
13G     .

続けて、以下。black-forest-labs/FLUX.1-dev は既にダウンロードしたのですぐ終わる。

%%time

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder_2=text_encoder_2,
    transformer=transformer,
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

そしていよいよ生成。

%%time

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(
    prompt,
    guidance_scale=3.5,
    num_inference_steps=50,
    generator=torch.manual_seed(0)
).images[0]

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
print(f"{memory=} GB.")

image.save("flux-nf4-dev.png")

VRAM がかなりギリギリ・・・。

$ nvidia-smi
...
| 41%   61C    P2   138W / 140W |  15098MiB / 16376MiB |    100%      Default |
...

そして、結局

OutOfMemoryError: CUDA out of memory.

Colab なら事情が異なるのだろうか？と思いつつもエラーメッセージを参考に、

$ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
$ jupyter lab

したらいけた。相変わらずカツカツだけど少し余裕が出ているみたい。T4 / A4000 のようにピッタリ 16 GB の VRAM ならこの起動方法が良さそう。L4 のように 24 GB くらいあればどちらでも良さそう。

$ nvidia-smi
...
| 59%   80C    P2   139W / 140W |  14566MiB / 16376MiB |    100%      Default |

100% [........] 50/50 [01:38<00:00, 2.01s/it]
The module 'T5EncoderModel' has been loaded in bitsandbytes 4bit and moving it to cpu via .to() is not supported. Module is still on cuda:0. In most cases, it is recommended to not change the device.
The module 'FluxTransformer2DModel' has been loaded in bitsandbytes 4bit and moving it to cpu via .to() is not supported. Module is still on cuda:0. In most cases, it is recommended to not change the device.
memory='12.141' GB.
CPU times: user 1min, sys: 39.9 s, total: 1min 40s
Wall time: 1min 40s

まぁまぁ速い・・・？A4000 で 1min 40s くらいで、L4 で 1min 57s くらい。

結局、ストレージは以下のようになった。

$ df -T
...
overlay        overlay 103019448 70426584  32576480  69% /