🤖

Stable Diffusionでフォント画像を生成してみる

2022/09/03に公開

font

Stable Diffusion

tech

前置き

最近ブームですよね、Stable Diffusion……というか、画像生成AI全般というべきでしょうか。見たところではイラストを生成するみたいな使い方が多く行われているように感じられます。実際、イラストを生成するのは良いです。楽しいし、素材にもなりますしね。
しかし……それはそれとして、ですよ。
私は今回、イラスト生成とはまた別のベクトルで役立つ手法があってもいいのではないか、と考えました。では……それは具体的に何か。『生成』 されるような 『画像』 で、単純な 『素材』 と言うわけでもない存在……回答例は色々あるでしょうが、私の場合はこう考えました。フォントです。
そういうわけで、この記事では……Stable Diffusionに、数字およびアルファベットをサポートするフォント画像を生成させていこうと思います。

環境

スペック

OS：Windows10 Home
メモリ：16GB
CPU：AMD Ryzen 5 3600
GPU：NVIDIA GeForce GTX 3060

Stable Diffusion周辺

この記事を参考に、DockerとOptimized Stable Diffusionで構築
Diffusersのここからimage_to_image.pyを回収してディレクトリに配置

Pythonのパッケージ

huggingface-hub v0.9.1
numpy v1.3.2
opencv-python v4.6.0.66
torch v1.12.1

試行錯誤

まず、基本的な方向性はこうです：

Stable Diffusionにはimg2imgという画像から画像を生成する機能がある！
ここに可読性の高いフォントのグリフ画像を突っ込んでプロンプトに ELDEN RING style とか書けば、いい感じのアレが生成されるんじゃね！？
とりあえずやってみよう！

とりあえずやってみます。

とりあえず全部突っ込んだ画像を読ませてみる

こんな画像を用意しました。

これはLexend Mediumというフォントのグリフのうち0-9とA-Zを並べたものです。Lexendそのものは読みやすそうだからくらいのつもりで選びました。とりあえず、この画像をStable Diffusionに渡したら何を吐き出すのか見てみましょう。
プロンプトはこうです：

sample of a beautiful font, in Elden Ring style, designed with Inkscape

Strengthをとりあえず0.8くらいにして……生成！結果はこうです！

どこを縦読みですか？
そういうわけで、Stable Diffusionは「なんか文字が並んでるな」くらいの感覚でアルファベット表を読みます。アルファベットの真似ではなく文字の真似ですから、無対策状態ではこのようにワードサラダが発生することになります。
ちなみに、0.5くらいに設定すると

このように普通に読める文字が生まれるんですが、普通に読めすぎて逆に嫌です。もっとぐねぐね湾曲したやつを求めたいところでしょう。
あとよく見ると普通に5が6になってる箇所がありますねこれ……。

逆に1種類しか文字がない画像を読ませてみる

「なんか文字が並んでるな」が問題であるというなら、逆に文字を並べないという選択肢はどうでしょう。つまるところ、

こういうやつです。
Stable Diffusionの気持ち^[1]になって考えてみると、先ほどの画像はAとかBとかCがめっちゃ並んでる画像であるのに対し、今のものはいわばAがめっちゃ並んでる画像ということになります。どれだけ解釈を広げようとAはAのままであり、全く別の文字に置換されることはそうないはずです^[2]。
実際に、先ほどと同じプロンプトでこの画像にimg2imgをかけてみましょう。

後ろの方にあるゴミはともかく、良い感じにAだけが出力されています。

画像の後処理

プロンプトにno backgroundやwhite backgroundなんかの要素を足しても背後にあるゴミが取れないので、OpenCVによる画像処理で除去することにします。

クリックしてコードブロックを表示

def extract_letters(img,bounds,blocks):
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    gray[gray[:,:] > 64] = 255
    _,mono = cv2.threshold(gray,127,255,cv2.THRESH_BINARY)
    mono = cv2.morphologyEx(mono, cv2.MORPH_CLOSE, np.ones((2,2),np.uint8))
    mono = cv2.morphologyEx(mono, cv2.MORPH_OPEN, np.ones((3,3),np.uint8))
    contours,_ = cv2.findContours(cv2.bitwise_not(mono), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    for cnt in contours:
        leftmost = tuple(cnt[cnt[:,:,0].argmin()][0])
        rightmost = tuple(cnt[cnt[:,:,0].argmax()][0])
        topmost = tuple(cnt[cnt[:,:,1].argmin()][0])
        bottommost = tuple(cnt[cnt[:,:,1].argmax()][0])
        if (leftmost[0] < img.shape[0]/2 <= rightmost[0]) or (topmost[1] < img.shape[1]/2 <= bottommost[1]):
            mono = cv2.drawContours(mono,[cnt],0,255,-1)
    cv2.rectangle(mono,(bounds[0][0]+(bounds[1][0]-bounds[0][0])//4,0),(bounds[0][0]+3*(bounds[1][0]-bounds[0][0])//4,img.shape[1]),255,-1)
    cv2.rectangle(mono,(0,bounds[0][1]+(bounds[2][1]-bounds[0][1])//4),(img.shape[0],bounds[0][1]+3*(bounds[2][1]-bounds[0][1])//4),255,-1)
    nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(cv2.bitwise_not(mono))
    sorted_nlabels = list(sorted(range(1,nlabels),key=lambda x:stats[x][4],reverse=True))
    out = np.ones(img.shape,np.uint8)*255
    i = 0
    last_stat = 0
    for nlabel in sorted_nlabels:
        if blocks > i or stats[nlabel][4] > stats[sorted_nlabels[1]][4]*0.1:
            out[labels==nlabel] = [0,0,0]
        else:
            break
        i+= 1
    return out

このコードがやっていることは要するに、

画像をグレースケールに変換
ある程度明るいピクセルを除去
輪郭を抽出し、画像を4分割して考えたときに分割線をまたぐようになっているものを除去
文字が絶対に存在しない中央付近を十字に白く塗る
白紙画像を作成
隣接したピクセルの塊を取得し、面積が一定未満になるまで大きい順に描画

という感じです。
これにより、例えばこの画像が

こうなります。

なかなかいい感じじゃないでしょうか。

それでもBがDになる

BがDになる問題はいまだに根強いです。例えばこんな感じで、ごく当然のようにしれっと紛れ込んでいます。

「俺、正真正銘のBですよ？」 みたいな顔するのやめてほしいですね。
この対策についてはいろいろ考えたんですが、要するに『1文字(2文字)だけ他と違う文字が存在する』ことが問題なわけじゃないですか。Stable Diffusionではいくらでもガチャが引き直せるわけで、なら存在しなくなるまで回し続ければ問題ないのでは？という話になります。
というわけで実装します。まずこれが四つの文字をそれぞれ切り出して画像配列として出力する関数で、

クリックしてコードブロックを展開

def trim_letters(img):
    parts,r = [np.hsplit(vimg,2) for vimg in np.vsplit(img,2)]
    parts.extend(r)
    out = []
    for p in parts:
        xmost = np.where(np.any(p==0,axis=1))
        ymost = np.where(np.any(p==0,axis=0))
        xo = xmost[0] if len(xmost[0])>=1 else [0,p.shape[0]]
        yo = ymost[0] if len(ymost[0])>=1 else [0,p.shape[1]]
        out.append(p[xo[0]:xo[-1],yo[0]:yo[-1]])
    return out

これが出力された画像配列をもとに文字の類似度を計算し、そのまま採用していいかを返す関数です。

クリックしてコードブロックを展開

def match_letters(parts):
    width_list = [p.shape[0] for p in parts]
    height_list = [p.shape[0] for p in parts]
    maximum_image = np.argmax(width_list+height_list)
    if (max(width_list)-min(width_list))+(max(height_list)-min(height_list))>30:
        return False
    else:
        maximum_hist = cv2.calcHist([parts[maximum_image]],[0],None,[256],[0,256])
        for i,p in enumerate(parts):
            if i != maximum_image:
                hist = cv2.calcHist([p],[0],None,[256],[0,256])
                if cv2.compareHist(maximum_hist,hist,3) > 0.04:
                    return False
    return True

類似度計算にはヒストグラムを利用しています。

生成してみる

これらを踏まえ、全体のコードを書きます。

クリックしてコードブロックを展開

from torch import autocast
import torch
from image_to_image import StableDiffusionImg2ImgPipeline, preprocess
from PIL import Image, ImageDraw, ImageFont
import cv2
import numpy as np

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    revision="fp16",
    torch_dtype=torch.float16,
	use_auth_token=True
).to("cuda")

def generate_original_image(characters,base_font):
    out = []
    for i in range(len(characters)):
        target = characters[i]
        img = Image.new('RGB', (512, 512),(255,255,255))
        draw = ImageDraw.Draw(img)
        font = ImageFont.truetype(base_font,size=192)
        bounds = []
        x,y,w,h = draw.textbbox((0,0),target,font=font)
        for count in range(4):
            dx = 30-x if count%2==0 else 512-30-w
            dy = 30-y if count<=1 else 512-30-h
            draw.text((dx,dy),target,(0,0,0),font=font)
            bounds.append((
                30+w if count%2==0 else 512-30-w+x,
                30+h if count<=1 else 512-30-h+y
            ))
        cv2_img = np.asarray(img)
        gray = cv2.cvtColor(cv2_img,cv2.COLOR_BGR2GRAY)
        _,th = cv2.threshold(gray,127,255,cv2.THRESH_BINARY)
        nlabels, _ = cv2.connectedComponents(cv2.bitwise_not(th))
        out.append([img,target,nlabels-1,bounds,h])
    return out


def extract_letters(img,bounds,blocks):
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    gray[gray[:,:] > 64] = 255
    _,mono = cv2.threshold(gray,127,255,cv2.THRESH_BINARY)
    mono = cv2.morphologyEx(mono, cv2.MORPH_CLOSE, np.ones((2,2),np.uint8))
    mono = cv2.morphologyEx(mono, cv2.MORPH_OPEN, np.ones((3,3),np.uint8))
    contours,_ = cv2.findContours(cv2.bitwise_not(mono), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    for cnt in contours:
        leftmost = tuple(cnt[cnt[:,:,0].argmin()][0])
        rightmost = tuple(cnt[cnt[:,:,0].argmax()][0])
        topmost = tuple(cnt[cnt[:,:,1].argmin()][0])
        bottommost = tuple(cnt[cnt[:,:,1].argmax()][0])
        if (leftmost[0] < img.shape[0]/2 <= rightmost[0]) or (topmost[1] < img.shape[1]/2 <= bottommost[1]):
            mono = cv2.drawContours(mono,[cnt],0,255,-1)
    cv2.rectangle(mono,(bounds[0][0]+(bounds[1][0]-bounds[0][0])//4,0),(bounds[0][0]+3*(bounds[1][0]-bounds[0][0])//4,img.shape[1]),255,-1)
    cv2.rectangle(mono,(0,bounds[0][1]+(bounds[2][1]-bounds[0][1])//4),(img.shape[0],bounds[0][1]+3*(bounds[2][1]-bounds[0][1])//4),255,-1)
    nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(cv2.bitwise_not(mono))
    sorted_nlabels = list(sorted(range(1,nlabels),key=lambda x:stats[x][4],reverse=True))
    out = np.ones(img.shape,np.uint8)*255
    i = 0
    last_stat = 0
    for nlabel in sorted_nlabels:
        if blocks > i or stats[nlabel][4] > stats[sorted_nlabels[1]][4]*0.1:
            out[labels==nlabel] = [0,0,0]
        else:
            break
        i+= 1
    return out

def trim_letters(img):
    parts,r = [np.hsplit(vimg,2) for vimg in np.vsplit(img,2)]
    parts.extend(r)
    out = []
    for p in parts:
        xmost = np.where(np.any(p==0,axis=1))
        ymost = np.where(np.any(p==0,axis=0))
        xo = xmost[0] if len(xmost[0])>=1 else [0,p.shape[0]]
        yo = ymost[0] if len(ymost[0])>=1 else [0,p.shape[1]]
        out.append(p[xo[0]:xo[-1],yo[0]:yo[-1]])
    return out



def match_letters(parts):
    width_list = [p.shape[0] for p in parts]
    height_list = [p.shape[0] for p in parts]
    maximum_image = np.argmax(width_list+height_list)
    if (max(width_list)-min(width_list))+(max(height_list)-min(height_list))>30:
        return False
    else:
        maximum_hist = cv2.calcHist([parts[maximum_image]],[0],None,[256],[0,256])
        for i,p in enumerate(parts):
            if i != maximum_image:
                hist = cv2.calcHist([p],[0],None,[256],[0,256])
                if cv2.compareHist(maximum_hist,hist,3) > 0.04:
                    return False
    return True

def generate(
    seed=100,
    characters="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
    base_font="./LexendDeca-Medium.ttf",
    base_contents="Elden Ring",
    prompt="sample of an useful vector bold font, in {base} style, designed with Inkscape, monochrome, no background",
    max_strength=0.9
):

    with autocast("cuda"):
        generator = torch.Generator("cuda")
        generator.manual_seed(seed)
        results = {}
        PROMPT = prompt.format(base=base_contents)
        i = 0
        for original_img,char,blocks,bounds,top_margin in generate_original_image(characters,base_font):
            print(char)
            init_img = preprocess(original_img)
            strength = max_strength+0
            while True:
                generated_image = (pipe(PROMPT,num_inference_steps=20,init_image=init_img,strength=strength,generator=generator)["sample"][0])
                extracted_image = extract_letters(np.asarray(generated_image),bounds,blocks)
                trimed_images = trim_letters(extracted_image)
                
                width_list = [p.shape[0] for p in trimed_images]
                height_list = [p.shape[0] for p in trimed_images]
                maximum_image = np.argmax(width_list+height_list)
                if match_letters(trimed_images):
                    results[char] = trimed_images[maximum_image]
                    break
                strength -= 0.01
                if strength < 0.5:
                    results[char] = trimed_images[maximum_image]
                    break
            i += 1
        i = 0
        for char,image in results.items():
            cv2.imwrite(f"./outputs/{str(i).zfill(len(str(len(characters))))}_{char}.png",image)
            i+=1

generate()

generate()を実行することで./outputs下にファイルが出力されます。
引数のseedをいじることで結果が変わるほか、base_contentsに基とするコンテンツを入力することができます。

出力例

画像ファイルはGitHubに置いてあります

`Elden Ring, Final Fantasy, Granblue Fantasy, Atelier Ryza, and Harry Potter`

`Star Wars`

`Hokusai`

おおむねいい感じに見えます！

感想

思ったよりうまいこと生成できたので、よかった
今度改めてttf化をやりたい
yとvやOとQなんかはまだ混ざることがあるので、もう少し何とかする方法が欲しい

脚注

そんなものがあるのか？ ↩︎
例えばOとQのような形状が似通った文字では起こりうる ↩︎

前置き

環境

スペック

Stable Diffusion周辺

Pythonのパッケージ

試行錯誤

とりあえず全部突っ込んだ画像を読ませてみる

逆に1種類しか文字がない画像を読ませてみる

画像の後処理

それでもBがDになる

生成してみる

出力例

Elden Ring, Final Fantasy, Granblue Fantasy, Atelier Ryza, and Harry Potter

Star Wars

Hokusai

感想

Discussion

`Elden Ring, Final Fantasy, Granblue Fantasy, Atelier Ryza, and Harry Potter`

`Star Wars`

`Hokusai`