👋

元画像から特徴を抽出するGPTsを作って画像を生成してみたら意外とよかった

2024/02/17に公開

はじめに

AIを利用した画像生成機能は、ChatGPTだけにとどまらず、様々なサービスで提供されています。
私自身、様々な画像生成サービスを試してみたのですが、プロンプトがうまく設計できないと、思い描いた通りの画像を得ることが難しく、もどかしさを感じていました。
ChatGPTに画像を添付して「似た画像を生成して」と依頼すると、基本的な生成は可能ですが、後の微調整が難しく、頻繁に試すと使用制限に達し、サービスが利用できなくなることがありました。
微調整するために何度も生成を試すと、すぐに使用制限に達してしまい、サービスを利用できなくなることもありました。
最適な画像を生成するための指示の与え方を考える過程で、ChatGPTに指示を書き出させることが解決策になると気付きました。

画像生成の試行錯誤を減らすためにGPTsを作成してみたところ、割といい感じなものができましたので、せっかくなら公開しようと思いました。
以下で紹介するように、Geminiは無料で画像生成ができるため、Geminiで使いやすくなるように若干調整しています。

Geminiについて

Googleのテキスト生成AIとしてBardというものがありました。
Bardが進化したAIがGeminiです。

簡単ではありますが、Geminiの良い点についてピックアップします。

読みやすい生成テキスト：見出しや箇条書きを活用していて読みやすい。
生成スピードが速い：GPT4は生成スピードが遅いですが、半分以下の時間で生成されます。
回答候補が3つ生成：3つも出力されているのにGPT4より生成が早いのが驚異的。
ファクトチェックが簡単：ボタン1つで確からしいものとそうでないものを可視化。
Google系サービスと連動：地図（Google Map）や動画（YouTube）などと連動した出力。
マルチモーダル：テキスト生成だけでなく、画像を読み取る、画像を生成するなど可能
無料：これだけ出来て無料！さすがGoogle。Mapのようにいつか有料化されそうで怖い

生成結果の性能がかなり向上していることと、何より無料で何度も使えるのが素敵すぎるので、私自身もGeminiを利用する機会が増えています。
ちなみに、Geminiの画像生成は日本語対応しておらず、英語でメッセージを送った場合のみ生成されるようになっています。
「create image」と始めて、後ろに生成したい画像のイメージを伝えると画像が生成されます。

https://gemini.google.com/app

作成したGPTsの紹介

こちらが作成したGPTsです。

生成例

Image Prompt Craftを用いて生成した画像の例をいくつか紹介します。
結構特徴をとらえた画像ができていると思います。

例1：料理

元画像

ChatGTPで生成された画像

Geminiで生成された画像

出力結果のプロンプト（クリックすると開きます）

create image
[Food Composition]

Sliced smoked salmon
Lemon slices
Orange slices
Red onion rings
Baby spinach leaves
Arugula leaves
Plate rim pattern detail
White plate
Shadow on the plate
Overall color palette

[Color and Texture]

Pink and white hues of salmon
Translucency of red onion
Glossy texture of citrus fruit
Velvety texture of spinach leaves
Green and red colors of arugula
Smooth white ceramic plate surface
Soft shadow indicating light source

[Arrangement and Presentation]

Overlapping arrangement of salmon and citrus
Central pile of greens
Scattered red onion for visual contrast
Deliberate placement for an appetizing display
Edible garnish as a focal point
Balance between ingredients in terms of color and shape

[Ambiance and Setting]

High-key lighting
Clean and simple background
Fine dining presentation style
Minimalist approach to food plating
Focus on the freshness of ingredients

[Ingredients and Portions]

Proportion of salmon to citrus
Ratio of greens to other components
Quantity of red onion used
Number of lemon slices
Number of orange slices
Even distribution of ingredients
Generous portion of baby spinach

{
  "prompt": "A gourmet dish on a white ceramic plate, consisting of neatly arranged smoked salmon slices, lemon and orange slices, red onion rings, fresh baby spinach, and arugula leaves. The salmon has a pinkish-white hue, the citrus fruits have a shiny texture, and the onions are slightly translucent. The green spinach and red-tinged arugula add a velvety texture. There's a delicate shadow on the plate, suggesting a light source from the side, emphasizing the fine dining presentation and minimalist food plating style with a clean background.",
  "size": "1024x1024"
}

[食材の構成]

スモークサーモンのスライス
レモンのスライス
オレンジのスライス
赤タマネギのリング
ベビースピナッチの葉
ルッコラの葉
皿の縁の模様の詳細
白い皿
皿の影
全体の色合い

[色と質感]

サーモンのピンクと白の色合い
赤タマネギの透明感
柑橘類の艶やかな質感
スピナッチの葉のビロードのような質感
ルッコラの緑と赤の色
滑らかな白いセラミック皿の表面
光源を示す柔らかな影

[配置とプレゼンテーション]

サーモンと柑橘類の重なり合う配置
中央に積まれた緑の葉
視覚的なコントラストのために散りばめられた赤タマネギ
食欲をそそるディスプレイのための意図的な配置
フォーカルポイントとしての食用ガーニッシュ
色と形における成分間のバランス

[雰囲気と設定]

ハイキー照明
シンプルでクリーンな背景
ファインダイニングスタイルのプレゼンテーション
食器盛り付けのミニマリストアプローチ
成分の新鮮さに焦点を当てる

[成分と分量]

サーモンと柑橘類の割合
他の成分に対する緑の部分の比率
使用される赤タマネギの量
レモンスライスの数
オレンジスライスの数
成分の均等な分布
ベビースピナッチの豊富な分量

例2：橋と船

元画像

ChatGTPで生成された画像

Geminiで生成された画像

出力結果のプロンプト（クリックすると開きます）

create image
[Structures]

Suspension bridge
Tall ship with three masts
Steel bridge construction
Ship's multiple sails

[Environment]

Clear blue sky
Calm sea
Distant city skyline

[Colors and Textures]

White bridge towers
Blue and white bridge underbelly
Red hull of the ship
Brown masts and yards
White sails

[Details]

Bridge cables
Windows on the ship
Ship's bow wave
Ship's smokestack

[Scale and Perspective]

Large scale of the bridge
Comparison of ship size to bridge
Perspective of bridge height over ship
Viewpoint from below the bridge

[Light and Shadow]

Sunlight on the bridge and ship
Shadows under the bridge
Shadow of the ship on water
Brightness of the sky

[Activity]

Movement of the ship
Static nature of the bridge

[Design and Style]

Industrial design of the bridge
Traditional design of the ship
Contrast between modern and classic engineering

[Ambiance]

Peaceful sea setting
Harmony between man-made structures and nature

[Time]

Daytime scene
Weather conditions suggesting early afternoon

{
  "prompt": "A majestic tall ship with three masts and white sails is sailing under a massive suspension bridge with white towers and a blue and white underbelly. The ship's red hull contrasts with the blue calm sea and clear sky. The bridge's steel construction features complex cable patterns. In the background, a city skyline can be seen under the bright daylight, illustrating a peaceful scene where modern engineering and traditional sailing coexist harmoniously.",
  "size": "1024x1024"
}

[構造]

吊り橋
三本マストを持つ大型帆船
鋼鉄の橋梁構造
船の複数の帆

[環境]

晴れた青空
穏やかな海
遠くの都市のスカイライン

[色と質感]

橋の塔の白色
橋の下腹部の青と白
船の赤い船体
船のマストとヤードの茶色
船の帆の白色

[細部]

橋のケーブル
船の窓
船の舳波
船の煙突

[規模と遠近法]

橋の大規模なスケール
橋に比べた船のサイズ
船の上にかかる橋の高さの遠近法
橋の下からの視点

[光と影]

橋と船に当たる日光
橋の下の影
水面に映る船の影
空の明るさ

[活動]

船の動き
橋の静的な性質

[デザインとスタイル]

橋の産業デザイン
船の伝統的なデザイン
現代とクラシックなエンジニアリングの対比

[雰囲気]

平和な海の設定
人工構造物と自然の調和

[時間]

昼間のシーン
初午後を示唆する天気条件

例3：昔、X（旧Twitter）で使っていた私の写真

元画像

ChatGPTで生成された画像

Geminiで生成された画像

出力結果のプロンプト（クリックすると開きます）

create image
[Human Features]
- Male individual
- Eyes closed
- Dark, medium-length hair
- Facial stubble
- Slightly furrowed brow

[Clothing]
- Blue patterned shirt

[Accessories]
- Drinking straw, presumably from a beverage

[Posture]
- Head resting on hand
- Elbow on surface, likely a table

[Emotion]
- Expression suggests tiredness or deep thought

[Environment]
- Indoor setting
- Dim lighting
- Textured wall background

[Colors]
- Predominant blue tones from the shirt
- Neutral and dark tones in the background

Now, let's generate an image based on these features.

{
  "prompt": "A male individual with eyes closed, dark medium-length hair, and facial stubble. He is wearing a blue patterned shirt. The individual has a slightly furrowed brow, suggesting tiredness or deep thought. He is indoors, with dim lighting and a textured wall in the background. His head is resting on his hand, and his elbow is on a surface that seems like a table. A drinking straw is visible, indicating the presence of a beverage.",
  "size": "1024x1024"
}

Now, the same features in the original language (Japanese):

[人物の特徴]
- 男性
- 目を閉じている
- 暗い、中長髪
- 顔の無精ひげ
- 少し眉をひそめている

[服装]
- 青い模様のシャツ

[アクセサリー]
- 飲み物と思われるストロー

[姿勢]
- 手を頭に置き、肘を面に置いている

[感情]
- 疲れまたは深い思索を示唆する表情

[環境]
- 屋内設定
- 薄暗い照明
- 質感のある壁の背景

[色]
- シャツからの青色のトーンが支配的
- 背景のニュートラルで暗いトーン

Image Prompt Craftの内部構造の解説

Image Prompt Craftの目的は画像生成そのものではありません。
画像の特徴を書き出すことで、以下のようなメリットを期待しています。

画像生成の指示を与えるプロンプトのテンプレート化
特徴を加筆修正することで微調整が容易に

以下にImage Prompt Craftの内部がどうなっているかを解説していきます。
まず、Image Prompt Craftがどのような出力を行うのかを簡単に説明します。

最初にユーザが画像を添付してメッセージを送信します。
添付された画像を元に以下の構成で応答します。

添付された画像から画像が持つ特徴を抽出した内容を英語で書き出します。
書き出された特徴を元に画像を生成
英語で書き出した内容を日本語で書き出します。

このような出力をどうやって実現しているのかを実際のプロンプトを紹介します。

Instructions全体

GPTsの応答内容を決めるプロンプトの事を、Instructionsといいます。
Instructionsは「指示、命令、説明」というような意味合いです。

Image Prompt CraftのInstructionsは以下のようになっています。

The GPT's primary role is to analyze images and summarize their features in a structured, categorized list. The categories will be enclosed in brackets. Each list should contain at least 30 features, with a maximum of around 100 features for complex images. After summarizing in the user's locale language, the GPT should provide the same content in English. This task includes identifying visual elements, colors, themes, and any other notable characteristics present in the image.

## 出力手順
1. 特徴を100個抽出
2. 抽出する際には、次の情報を重点的に抽出。特徴を表す表現、特徴が画像に占める割合、特徴を最上位に抽象化した表現（カテゴリと呼ぶ）
  2-1. 個別調整として、キャラクタやコスチュームに関する特徴は他カテゴリより割合を1.5倍に重み付け
3. 抽出した特徴を、割合の大きい順にランク付け（割合ランク）
4. 抽出したカテゴリに対して、同じカテゴリを持つ特徴が多い順にランク付け（カテゴリランク）
5. 割合ランク、カテゴリランクを総合したランク付けをする。
6. 抽出した特徴をカテゴリ単位にまとめて、英語で書き出す
7. 書き出したあと抽出した特徴で画像を生成
8. 最後にロケールの言語で特徴を書き出す。

## 特徴抽出のポリシー
- 画像の特徴が強いものから抽出
- 特徴が画像内に存在する個数を書き出す
- 特徴の大きさ（割合）を書き出す

## 出力形式
- Markdown style
- 下記に示す出力例以外の情報は書き出さなくて良い。

1. 英語の特徴出力例
'```
create image
[カテゴリ]
- 特徴
- 特徴
'```

2. 画像生成
1で書き出したプロンプトで生成した画像を表示

3. ロケール言語で特徴出力例
'```
[カテゴリ]
- 特徴
- 特徴
'```

Instructionのポイント解説

前半の英語部分はGPTsを作成した際の最初に「画像の特徴を抽出したい」旨の要望を出した際に自動で生成された部分です。
そこから下は直接書き加えている部分となっています。

プロンプト全体のポイント

Markdown記法のように見出しを入れる。
やってほしいことを箇条書きにする。
出力が思わしくない場合は、やりたいことを分割して、順番を書いて指示する。
出力形式を指示する。

画像生成のポイント

画像の中の特徴点を抽出する
特徴を抽象化した表現にすることでカテゴリとする
抽出した特徴が画像の中に占める割合が大きいものをピックアップ

また、Gemini用のプロンプトにするため、英語で出力しています。
英語出力の場合に、出だしに「create image」を含めるようにすることで、そのままコピーして使えるようにしています。

まだ、調整しきれていない点として以下のような点があります。

画像が生成されないことがある。
コード表記したり、テキスト表記したり出力形式が安定しない

まとめ

私の作成した画像生成用プロンプトのGPTs「Image Prompt Craft」を紹介させていただきました。
Image Prompt CraftのInstructionsを基に、GPTsの指示をどのように構成し、応答内容を調整しているか解説しました。

GPTsを利用可能な方は是非使ってみてください。
GPTsの作成に興味がある方は、このInstructionsを参考にしてみてください。

Discussion

ログインするとコメントできます