DeepLearning.ai Large Multimodal Model Prompting with Geminiの学習記録

マルチモーダルモデル特有のpromptingを学ぶため、https://www.deeplearning.ai/short-courses/large-multimodal-model-prompting-with-gemini/ を受講する。

これは他のcourseとは異なりnotebookの実行環境は用意されていないが、Google Colabで実行する方法も用意されている。いろいろ試してみたいので、これを機にGoogle Cloudアカウントを作成して、Colabで触ってみる。

aijnek

 Introduction to Gemini ModelsGemini Family (上から下に、精度高→低, パフォーマンス低→高, コスト高→低)
Ultra: 高精度だが遅い
Pro: 精度とパフォーマンスのバランスがよい
Flash: リアルタイムな会話などパフォーマンスが大事なケース
Nano: distillationでつくられたedge device用モデル
Multimodal modelの特徴
inputはテキストだけでなく、画像, audio, video, PDF, codeなど
複数の種類がinterleaveされていてもよい
cross modal reasoningも可能 (例：画像に含まれる手書きの図や数式とtext promptに応答する)

aijnek

 multi modal prompting and parameter controlImageとTextのpromptの場合、Imageを先におくほうがよい結果が得られやすい
ImageやVideoを説明するCaptioningもできるが、Reasoningもできる（例：この画像にうつっている人物の職業は？ なぜこの動画が作成されたか推測して。）
parameters
Top k: the top k possible next words
Top p: the top p possible next words with a cumulative probability >= p
Temperature: randomness when picking next word from possible choices
temperatureが高くても、p=0.01と低くしたり、k=1とすることでrandomnessやcreativityは低く抑えられる。最適なパラメータ設定はscienceというよりartの領域。
max output tokenは、コスト抑制の目的で使うイメージだったが、回答の簡潔さのコントロールにも使える。
stop sequenceで指定したwordsの直前で回答が打ち切られる。ChatGPTがでてきた頃は余計な出力をはぶくために使った記憶があるが、いまどのような用途があるのかわからない。

aijnek

 Best Practices for Multimodal PromptingGemini 1.5 Pro image requirements
imageは3,000枚まで(!)
PNG or JPEG
pixel数制限はないが、大きいサイズのイメージはscale downとpaddingが行われる
ひとつのImageは258トークン
他のmultimodal modelは画像サイズによってtoken数が異なる。
modelによって振る舞いが異なるので1つのpromptがすべてのmodelで機能するとは限らない。
best practices:
be clear and concise
タスク指示書のように何をしてほしいか明確にする
モデルとcontextを共有できていると想定しない
専門的すぎる言葉も使わない

roleをassignする
単にdescribe the imageとするのではなく、You are and AI that does image understanding.のように役割を与えて何をすべきか明確にする
(system promptに書くのがよい）

structured promptを使う
image, roleなどに対してmarkdownなどでタグをつける
structureの例
role
objective: モデルに何を達成してほしいか
context: imageなどはここに該当
constraints: 長さ, 回答のformat、言語など


順番は大事
multimodalで入れ子にする場合は実験で適切な順番を見つけよう

aijnek

Creating Use Cases with Images

promptに含まれる空白は結果に影響を与えるので注意が必要。あまりケアしてなかった。

複数の画像とroleやinstructionを含むpromptの順番の例

receipt_content = [
    INSTRUCTION,
    ROLE,
    "Answer the questions based on the following receipts:"
    "breakfast:",
    receipt_images[0],
    "lunch:",
    receipt_images[1],
    "diner",
    receipt_images[2],
    "meal-others",
    receipt_images[3],
    ASSIGNMENT,
    policy,
]

ASSIGNMENTの例

ASSIGNMENT = """
You are reviewing travel expenses for a business trip.
Please complete the following tasks:
1. Itemize everything on the receipts, including tax and \
total.  This means identifying the cost of individual \
items that add up to the total cost before tax, as well \
as the tax ,such as sales tax, as well as tip.
2. What is the total sales tax paid?  In some cases, \
the total sales tax may be a sum of more than one line \
item of the receipt.
3. For this particular receipt, the employee who is \
adding this business expense purchased the meal with \
a group. The employee only ordered the KFC Bowl. Please \
provide the cost of the employee's order only.  Include \
both the cost before tax, and also estimate the tax \
that is applied to this employee's order.  To do this,\
calculate the fraction of the employee's pre-tax order\
divided by the total pre-tax cost.  This fraction can be \
applied to the total sales tax that you calculated earlier.
4.  Please calculate the amount spent by others, which \
are all the other line items on the receipt.  Please \
provide this sum before tax, and if possible, apply the \
tax for the total cost.
5. Check the expenses against company policy and flag \
if there are issues.
"""

gemini1.5-proを使ったからか、これだけの数のtaskでもちゃんと追従していた。1回のcallで1つのtaskに集中させたほうがqualityは上がるのではないかと思うが、今回のケースは1-4はstep by stepで実行することになるので、そのような場合は一度にさせたほうがよさそうだ。task数と精度のtradeoffは詳しく調べてみたい。
。

aijnek

Developing Use Cases with Videos

数分のvideoの内容をもとに下記の質問をすると, gemini1.5-flashでもちゃんと答えることができた。それぞれに該当するシーンは1秒ないくらいだったので驚き。

Answer the following questions using the video only.

Questions:
- What is the most searched sport?
- Who is the most searched scientist?
- What is the fastest record of solving rubic cube?

ただし、最後に

- When did pokemon TV program start?

というvideoに含まれない内容をいれたところ、この質問の回答はnot mentionedになったのはよいが、正解できていたほかのquestionに対してもnot mentionedとなってしまった。このあたりはmodelの弱さかもしれない。

aijnek

 Integrating Real-Time Data with Function CallingFunction Callingによってexternal APIやcustom functionにアクセスできるようにすることで、LLMアプリができることを大きく広げることできる。
LLMの役割は、APIやfunctionをどのような引数で呼ぶべきかを判断すること。アプリ側でAPIやfunctionをcallし、その結果をもとに再度LLMが処理を行う。
(特にmulti modal modelに限らない話だった）

aijnek

 感想multimodal modelに特有のpromptがたくさんあるわけではなく、role, context, instructionの順番に気を付けることと、できる限りclearに書くという通常のprompt best practiceがそのまま当てはまる、というものだった。ハルシネーションも出力の不安定さももちろんあるので、そのあたりの対処はtext modelと変わらない。
ただ、改めて複数種類の情報を使ってreasoningできるcapabilityは強力だと感じた。人間は視覚から多くの情報を得ているし、textにしっかりと意図を明確に書くのは大変。画像だけでなく動画もdetailまで見て判断することができているのは驚きだった。
Function callingはとても好きな機能。GeminiだけでなくOpenAIもClaudeも提供しているのでどんどん活用していきたい。

このスクラップは2024/08/31にクローズされました