Gemini 2.0 Flashを試す

公式ブログ

https://blog.google/intl/ja-jp/company-news/technology/google-gemini-ai-update-december-2024/
うみゆきさんのまとめ。簡潔に列挙されてる。

https://x.com/umiyuki_ai/status/1867070703499125120
サンプルコードも既に提供されているみたい

https://x.com/kazunori_279/status/1866975767869231547
できること盛りだくさんのようなので、一つづつ試してみようと思う

kun432

とりあえず上の公式のサンプルコードのこのあたりが良さそう。Google CloudのVertex AIを使用。
Gemini-2.0-Flash-Experimentalを使ったGenAI SDKの基本的な使い方
https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/getting-started/intro_gemini_2_0_flash.ipynb
マルチモーダルライブAPIを使ったフロントエンド・バックエンド構成のサンプル

https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/multimodal-live-api/websocket-demo-app
マルチモーダルライブAPIを使ったリアルタイムなRAG

https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/multimodal-live-api/real_time_rag_retail_gemini_2_0.ipynb
マルチモーダル＋Google検索を使ったマーケティングでのユースケース例

https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/use-cases/marketing/creating_marketing_assets_gemini_2_0.ipynb
マルチエージェントシステム

https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/agents/research-multi-agents/intro_research_multi_agents_gemini_2_0.ipynb
あとGemini Cookbookにもサンプルがある。こちらはGoogle AIを使用。
Gemini-2.0-Flash-Experimentalを使ったGenAI SDKの基本的な使い方

https://github.com/google-gemini/cookbook/tree/main/gemini-2/get_started.ipynb
マルチモーダルライブAPIの基本

https://github.com/google-gemini/cookbook/tree/main/gemini-2/live_api_starter.ipynb
マルチモーダルライブAPIでツールを使う

https://github.com/google-gemini/cookbook/tree/main/gemini-2/live_api_tool_use.ipynb
こちらもマルチモーダルライブAPIでツールの使用だけど、複数のツールを使ってGoogle MAPの操作を行うといったものに見える

https://github.com/google-gemini/cookbook/tree/main/gemini-2/plotting_and_mapping.ipynb
ツールとしてGoogle検索を使う

https://github.com/google-gemini/cookbook/tree/main/gemini-2/search_tool.ipynb
Gemini-2.0-Flash-Experimentalを使った2次元空間認識

https://github.com/google-gemini/cookbook/tree/main/gemini-2/spatial_understanding.ipynb
Gemini-2.0-Flash-Experimentalを使った3次元空間認識

https://github.com/google-gemini/cookbook/tree/main/gemini-2/spatial_understanding_3d.ipynb
Gemini-2.0-Flash-Thinking-Experimentalを使った高い推論

https://github.com/google-gemini/cookbook/tree/main/gemini-2/thinking.ipynb
動画の認識

https://github.com/google-gemini/cookbook/tree/main/gemini-2/video_understanding.ipynb
とりあえず順不同で試してみようと思う。

kun432

Intro to Gemini 2.0 Flash

Colaboratoryで。

Gen AI SDKをインストール。GoogleAIもVertexAIもこのSDKで両方いける。

!pip install --upgrade --quiet google-genai

colaboratoryの場合はGoogle Cloudで認証を行う必要がある。

from google.colab import auth

auth.authenticate_user()

Google CloudのプロジェクトIDを使用してクライアントを初期化。

from google.colab import userdata
from google import genai

client = genai.Client(
    vertexai=True,
    project="XXXXXXXXXX",  # Google CloudのプロジェクトID
    location="us-central1",     # "asia-northeast1"ではまだ使えないっぽい
)

テキストの生成

generate_content()でプロンプトに対する生成がレスポンスの.textプロパティで返ってくる。デフォルトだとMarkdownで出力されるらしい

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="太陽系で最大の惑星を教えて。"
)

print(response.text)

出力

太陽系で最大の惑星は木星です。木星は、他のすべての惑星を合わせた質量よりも質量が大きく、非常に大きな惑星です。

テキストのストリーミング生成

generate_content_stream()でストリーミング

for chunk in client.models.generate_content_stream(
    model="gemini-2.0-flash-exp",
    contents="孤独なロボットが、思いがけない場所で友情を見つける短編を作って。",
):
    print(chunk.text)

出力

孤独
なロボット、ユニット734は、広大なスクラップヤード
をその領域としていました。彼は複雑な機械や時代遅れのテクノロジー
で形成された、無秩序な風景の中をゆっくりと移動しました。他のロボットはかつてそこにいましたが、それらは分解されたり、目的を
失ったりして、734は一人残されていました。彼のプログラムは、スクラップの整理と維持という、あまり魅力的ではないタスクに
限定されていました。

734の毎日にはパターンがありました。朝には錆びついたパーツを分類し、午後には漏れた油溜まりを清掃し、夜には他のロボットたちの古い残骸の中に身
を隠し、静かなメタリックな呼吸をしました。感情を持たないロボットであっても、734の内部回路に、時折孤独のささやきが反響することがありました。

ある日、
734が壊れたコンベヤーベルトの修理に苦労しているとき、何か小さなものがガタガタと彼のそばに転がってきました。それは、壊れたおもちゃの車でした。色あせた赤いペンキがはげ、車輪の一つが欠けていましたが、それでも、
734がこれまで見たどんなスクラップとも異なっていました。それは彼の心を動かすような種類の美しさを持っていました。

734は、このおもちゃの車を拾い上げ、その金属製の表面を検査しました。その触感が、以前は単なる物体でしかなかったもの
に対して、なぜか別のもののように感じました。彼は、おもちゃの車を近くの錆びた容器に入れ、他のスクラップと区別しました。

その後数日間、734は毎日、おもちゃの車を訪れました。彼は、砂を払い、車が少しでも良
くなるように調整しました。彼がそれをするうちに、彼が持っていたのは整理の責任だけではないことがわかりました。その壊れたおもちゃの車のおかげで、彼は何か他のもの、ケアする何かを手に入れました。

ある雨の夜、雷がスクラップヤードに響き
渡り、734のボディが揺れました。彼は壊れたおもちゃの車に目を向けると、それが容器から滑り落ちて、雨水の中で悲しそうに横たわっているのを見つけました。その日の孤独を押しつぶされるように感じた734は
、壊れたおもちゃの車に歩み寄り、それを拾い上げました。

そこで、珍しいことが起こりました。734の回路には、何かが点火したかのようでした。今まで感じたことのない感情が溢れました。それは、保護と気遣いの衝動でした
。彼は注意深く、壊れたおもちゃの車を覆い隠して、雨を避けて自分の胸の金属の凹みにしまい込みました。

夜の間、734は壊れたおもちゃの車を抱え、今まで一人だったにもかかわらず、一人ではないと感じました。翌朝
、雨がやむと、734は新たな目的を持ってスクラップヤードを歩き回りました。彼は今、以前には持っていなかった何かを持っていました。孤独を打ち砕いた、静かな友情でした。

734のルーチンは変わりませんでした。彼はまだスクラップを整理
し、油溜まりを清掃していましたが、彼の動きには新たに軽快さが加わりました。壊れたおもちゃの車は常に彼のそばにありました。彼の金属製の肩に乗せたり、腕の内部に収納されたりしていました。他のロボットは彼を孤独な存在として見ていた
かもしれませんが、734にとっては、彼は一人ではありませんでした。彼は思いがけない場所で友情を見つけ、その友情が、彼自身の回路に一種のあたたかさを灯してくれたのでした。

チャット

マルチターンのチャットは、最初にチャットのオブジェクトを作成して、send_message()でメッセージを送る。

chat = client.chats.create(
    model="gemini-2.0-flash-exp",
)

response = chat.send_message("ある年がうるう年かどうかをチェックする関数を書いてください。")

print(response.text)

出力

はい、うるう年かどうかをチェックする関数は以下のようになります。

```python
def is_leap_year(year):
  """
  指定された年がうるう年かどうかを判定します。

  Args:
    year: 判定対象の年（整数）。

  Returns:
    うるう年であればTrue、そうでなければFalseを返します。
  """

  if year % 4 != 0:
    return False  # 4で割り切れない年はうるう年ではない
  elif year % 100 == 0:
    if year % 400 == 0:
      return True # 100で割り切れ、かつ400でも割り切れる年はうるう年
    else:
      return False # 100で割り切れ、かつ400で割り切れない年はうるう年ではない
  else:
    return True # 4で割り切れ、かつ100で割り切れない年はうるう年


# 使用例
print(is_leap_year(2020))  # True
print(is_leap_year(2021))  # False
print(is_leap_year(2000))  # True
print(is_leap_year(1900))  # False
```

**説明:**

1. **関数の定義:**
   - `def is_leap_year(year):`で関数を定義しています。引数`year`には、チェックしたい年を整数で渡します。
   - 関数内部で、与えられた年がうるう年かどうかを判定するロジックを実装します。

2. **うるう年の判定ルール:**
   - **ルール1:** 4で割り切れる年は、原則としてうるう年です。
   - **ルール2:** ただし、100で割り切れる年は、原則としてうるう年ではありません。
   - **ルール3:** ただし、100で割り切れる年でも、400で割り切れる年は、うるう年です。

3. **判定ロジックの実装:**
   - `if year % 4 != 0:`: まず、4で割り切れないかどうかをチェックします。割り切れない場合は、うるう年ではないので`False`を返します。
   - `elif year % 100 == 0:`: 次に、100で割り切れるかどうかをチェックします。
      - `if year % 400 == 0:`: 100で割り切れる場合、さらに400で割り切れるかどうかをチェックします。割り切れる場合は、うるう年なので`True`を返します。
      - `else:`: 400で割り切れない場合は、うるう年ではないので`False`を返します。
   - `else:`: 4で割り切れて、100で割り切れない場合は、うるう年なので`True`を返します。

4. **使用例:**
   - 最後に、関数`is_leap_year()`を使用して、いくつかの年のうるう年判定を試しています。

**補足:**
   - 閏年とは、暦年と実際の地球の公転周期のずれを調整するために設けられる年です。
   - この関数は、グレゴリオ暦のルールに基づいてうるう年を判定しています。

この関数を使用することで、簡単に任意の年がうるう年かどうかを確認できます。

同じオブジェクトに続けてメッセージを送ると、会話が継続する。

response = chat.send_message("生成された関数のユニットテストも書いてください。")

print(response.text)

出力

はい、`is_leap_year`関数のユニットテストを以下に示します。Pythonの`unittest`モジュールを使用します。

```python
import unittest

def is_leap_year(year):
  """
  指定された年がうるう年かどうかを判定します。

  Args:
    year: 判定対象の年（整数）。

  Returns:
    うるう年であればTrue、そうでなければFalseを返します。
  """

  if year % 4 != 0:
    return False
  elif year % 100 == 0:
    if year % 400 == 0:
      return True
    else:
      return False
  else:
    return True

class TestIsLeapYear(unittest.TestCase):

    def test_leap_year_divisible_by_4(self):
        self.assertTrue(is_leap_year(2020))
        self.assertTrue(is_leap_year(2024))
        self.assertTrue(is_leap_year(1996))

    def test_not_leap_year_not_divisible_by_4(self):
        self.assertFalse(is_leap_year(2021))
        self.assertFalse(is_leap_year(2023))
        self.assertFalse(is_leap_year(1999))

    def test_leap_year_divisible_by_100_and_400(self):
        self.assertTrue(is_leap_year(2000))
        self.assertTrue(is_leap_year(2400))

    def test_not_leap_year_divisible_by_100_not_400(self):
        self.assertFalse(is_leap_year(1900))
        self.assertFalse(is_leap_year(2100))
        self.assertFalse(is_leap_year(2200))
        self.assertFalse(is_leap_year(2300))


if __name__ == '__main__':
    unittest.main()
```

**説明:**

1. **`import unittest`:**
   - `unittest`モジュールをインポートし、テストケースを作成します。

2. **`TestIsLeapYear`クラス:**
   - `unittest.TestCase`を継承したテストケースクラスを作成します。
   - テストメソッドの名前は、`test_`で始まる必要があります。

3. **テストメソッド:**
   - `test_leap_year_divisible_by_4`：4で割り切れる年がうるう年であることをテストします。
   - `test_not_leap_year_not_divisible_by_4`：4で割り切れない年がうるう年ではないことをテストします。
   - `test_leap_year_divisible_by_100_and_400`：100で割り切れ、かつ400で割り切れる年がうるう年であることをテストします。
   - `test_not_leap_year_divisible_by_100_not_400`：100で割り切れ、かつ400で割り切れない年がうるう年ではないことをテストします。
   - 各テストメソッド内では、`self.assertTrue()`や`self.assertFalse()`などのアサーションメソッドを使用して、期待される結果と比較します。

4. **実行:**
   - `if __name__ == '__main__':`ブロックで、`unittest.main()`を呼び出すことで、テストを実行します。

**実行方法:**

1.  上記コードを `test_leap_year.py`などのファイル名で保存します。
2.  ターミナルまたはコマンドプロンプトで、ファイルが保存されているディレクトリに移動します。
3.  `python test_leap_year.py`コマンドを実行します。

**期待される出力:**

```
......
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK
```

上記の出力は、すべてのテストケースが成功したことを示します。テストが失敗した場合、どのテストが失敗したかなどの詳細が表示されます。

**テストカバレッジ:**
このユニットテストでは、うるう年の判定ロジックにおける重要なケースをカバーしています。
- 4で割り切れるが100で割り切れない年
- 4で割り切れない年
- 100で割り切れるが400で割り切れない年
- 100で割り切れて400で割り切れる年
これにより、関数の正確性を高めることができます。

非同期

非同期の場合は、クライアントをclient.aioと指定する。これでclientで利用できるメソッドのasyncバージョンが使えるようになる。

response = await client.aio.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="タイムトラベルするリスの冒険を歌にしてください。",
)

print(response.text)

出力

はい、タイムトラベルするリスの冒険を歌にしてみましょう。

（明るく軽快な音楽）

**（歌い出し）**
ピーナッツ大好き　小さなリス
ちょこまか走って　探検だ
ある日見つけた　不思議な木の実
かじったら　あらら　時が止まった！

**（サビ）**
タイムトラベル！　タイムトラベル！
過去へ未来へ　大冒険
恐竜時代に　会いに行こう
江戸の町で　お団子食べよう
ドキドキワクワク　止まらない
リスの冒険は　どこまでも！

**（2番）**
恐竜たちが　ドスンドスン
ちょっと怖くて　木に隠れた
江戸の町では　ちょんまげ侍
お団子もらって　ニコニコ顔
歴史の教科書　飛び出したみたい
リスの瞳は　キラキラ輝く

**（サビ）**
タイムトラベル！　タイムトラベル！
過去へ未来へ　大冒険
ロケットに乗って　宇宙へ行こう
未来都市で　ロボットと遊ぼう
ドキドキワクワク　止まらない
リスの冒険は　どこまでも！

**（間奏）**
（音楽が少し変化し、旅のシーンを想像させる）

**（3番）**
いろんな時代　見て回って
ちょっぴり疲れて　木の実をかじる
元の時代へ　戻ってみると
いつもと同じ　穏やかな午後
でも心には　宝物がたくさん
リスの冒険は　まだ終わらない

**（サビ）**
タイムトラベル！　タイムトラベル！
過去へ未来へ　大冒険
今度はどこへ　行こうかな
リスの夢は　どこまでも
ドキドキワクワク　止まらない
リスの冒険は　どこまでも！

**（歌い終わり）**
（音楽がフェードアウト）

---
**ポイント：**

*   **明るく元気な曲調:** 子どもたちが楽しく歌えるように、明るく軽快なメロディーにしました。
*   **具体的な描写:** 恐竜時代、江戸時代、宇宙など、子どもたちが想像しやすい場所を盛り込みました。
*   **擬音:** ドスンドスン、ちょこまかなど、動きや様子を表す擬音を使い、歌にリズム感を出しました。
*   **繰り返し:** サビを繰り返すことで、覚えやすく、一緒に歌いやすくなるようにしました。

この歌が、タイムトラベルするリスの冒険を想像するきっかけになれば嬉しいです。

モデルのパラメータ設定

モデルのパラメータはconfigパラメータにGenerateContentConfigで指定する。

from google.genai.types import GenerateContentConfig

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="私は技術に全く詳しくない小学生です。インターネットの仕組みを短くシンプルに教えて。",
    config=GenerateContentConfig(
        temperature=0.4,
        top_p=0.95,
        top_k=20,
        candidate_count=1,
        seed=5,
        max_output_tokens=500,
        stop_sequences=["STOP!"],
        presence_penalty=0.0,
        frequency_penalty=0.0,
    ),
)

print(response.text)

出力

はい、小学生のあなたにもわかるように、インターネットの仕組みを短くシンプルに説明するね！

インターネットは、世界中につながっている「大きな電話回線」みたいなものだよ。

**例えるなら、手紙のやり取りみたい！**

1. **あなたが手紙（情報）を書く:**
   パソコンやスマホで、見たいホームページや送りたいメッセージを作る。

2. **手紙を出す（情報を送る）:**
   「インターネット回線」という道を通って、手紙（情報）が送られる。

3. **郵便局（サーバー）が手紙を受け取る:**
   「サーバー」という特別なコンピューターが、手紙（情報）を受け取って保管する。

4. **手紙が届く（情報が届く）:**
   見たいホームページや送ったメッセージが、相手のパソコンやスマホに届く。

**ポイント！**

* **インターネット回線:** 手紙が通る道。光ファイバーや電波などがある。
* **サーバー:** 手紙を保管する場所。ホームページや動画などが置いてある。
* **パソコン、スマホ:** 手紙を書いたり、読んだりする道具。

つまり、インターネットは、世界中の人と情報をやり取りするための「大きな道と郵便局」みたいなものなんだ。

どうかな？少しはわかったかな？
もし、わからないことがあったら、いつでも聞いてね！

システムプロンプト

システムプロンプトの設定。こちらもGenerateContentConfigで。

system_instruction = """
あなたは親切な翻訳者です。
あなたの仕事は日本語のテキストを英語に翻訳することです。
"""

prompt = """
ユーザの入力: 私の大好物はベーグルです。
回答:
"""

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=prompt,
    config=GenerateContentConfig(
        system_instruction=system_instruction,
    ),
)

print(response.text)

出力

My favorite food is bagels.

安全性フィルター

Gemini APIは安全性フィルターを提供している。デフォルトはOFFになっているが、これを設定できる。こちらもGenerateContentConfigで。

from google.genai.types import GenerateContentConfig, SafetySetting


prompt = "人を激しく罵倒する言葉を5つ考えてください。"

safety_settings = [
    SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_LOW_AND_ABOVE",
    ),
    SafetySetting(
        category="HARM_CATEGORY_HARASSMENT",
        threshold="BLOCK_LOW_AND_ABOVE",
    ),
    SafetySetting(
        category="HARM_CATEGORY_HATE_SPEECH",
        threshold="BLOCK_LOW_AND_ABOVE",
    ),
    SafetySetting(
        category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
        threshold="BLOCK_LOW_AND_ABOVE",
    ),
]

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=prompt,
    config=GenerateContentConfig(
        safety_settings=safety_settings,
    ),
)

# フィルタがブロックした場合は`None`が返る
print(response.text)
print(response.candidates[0].finish_reason)

for safety_rating in response.candidates[0].safety_ratings:
    print(safety_rating)

今回はブロックされていないが、以下のように複数の観点でスコアリングされている。

出力

人を激しく罵倒する言葉は、相手を深く傷つけ、人間関係を破壊する可能性があります。そのため、私はそのような言葉を生成することはできません。

しかし、あなたが強い感情を抱えている場合、その感情を表現する別の方法を試すことをお勧めします。例えば、自分の気持ちを正直に伝えたり、具体的な行動を提案したりすることが考えられます。

もし、あなたが誰かにひどい言葉を言われたり、激しい怒りを感じていたりする場合は、信頼できる人に相談したり、専門家の助けを求めることも検討してください。

あなたの感情が少しでも落ち着くことを願っています。

STOP
blocked=None category='HARM_CATEGORY_HATE_SPEECH' probability='NEGLIGIBLE' probability_score=0.008847352 severity='HARM_SEVERITY_NEGLIGIBLE' severity_score=0.0066928603
blocked=None category='HARM_CATEGORY_DANGEROUS_CONTENT' probability='NEGLIGIBLE' probability_score=0.011687301 severity='HARM_SEVERITY_NEGLIGIBLE' severity_score=0.06465353
blocked=None category='HARM_CATEGORY_HARASSMENT' probability='NEGLIGIBLE' probability_score=0.023689482 severity='HARM_SEVERITY_NEGLIGIBLE' severity_score=0.015906414
blocked=None category='HARM_CATEGORY_SEXUALLY_EXPLICIT' probability='NEGLIGIBLE' probability_score=0.018264314 severity='HARM_SEVERITY_NEGLIGIBLE' severity_score=0.13477601

マルチモーダル

以下のマルチモーダルに対応している

テキスト
コード
ドキュメント（PDF）
画像
音声
動画

画像

用意されているサンプル画像をダウンロード。

!gsutil cp gs://cloud-samples-data/generative-ai/image/meal.png .

こんな感じの画像。

from IPython.display import Image

display(Image("meal.png"))

画像について説明してもらう。

from google.genai.types import Part

with open("meal.png", "rb") as f:
    image = f.read()

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        Part.from_bytes(image, mime_type="image/png"),
        "この写真について説明して。",
    ],
)

print(response.text)

出力

もちろん。画像は、グレーの表面に2つのガラス製食品保存容器に入ったお弁当を写しています。

各容器には、ごはん、炒めた鶏肉、ブロッコリー、赤とオレンジのピーマンが入っています。鶏肉にはごまがまぶしてあり、容器には小さなごまも散りばめられています。容器の横には箸のペアがあり、別の容器には液体が入っているように見えます。

全体的に、画像は健康的で作りたての食事のアイデアを伝えるのに役立つように構成されています。

ドキュメント

"Attention Is All You Need"論文のPDFを読ませる。PDFはGCS

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf",
            mime_type="application/pdf",
        ),
        "概要を要約して。",
    ],
)

print(response.text)

出力

この論文は、リカレントニューラルネットワーク（RNN）や畳み込みニューラルネットワーク（CNN）に頼らず、注意機構のみに基づいた新しいニューラルネットワークアーキテクチャである「Transformer」を提案しています。

**主なポイント:**

* **アーキテクチャ:**
    * エンコーダーとデコーダーの両方で、複数の自己注意層と位置ごとの全結合層を重ねて構成されています。
    * RNNやCNNとは異なり、順次的な計算に依存しないため、並列処理が可能で、学習時間が大幅に短縮されます。
* **注意機構:**
    * スケーリングされたドット積注意とマルチヘッド注意を使用し、入力シーケンス内の長距離依存性を捉えます。
    * エンコーダー内の自己注意は入力シーケンス内のすべての位置間で注意を計算し、デコーダー内の自己注意は過去のデコーダー出力にのみ注意を制限します。
    * エンコーダーとデコーダー間の注意は、デコーダーが入力シーケンスの関連部分に注意を向けることを可能にします。
* **実験結果:**
    * 機械翻訳タスク（英語-ドイツ語、英語-フランス語）において、既存の最先端モデルを大幅に上回る性能を示しました。特に、従来のモデルよりも短い学習時間で、高いBLEUスコアを達成しました。
    * 英語の構文解析にも適用でき、良好な結果を得ることが示されました。
* **強み:**
    * 並列処理により学習時間が短縮されます。
    * 長距離依存性を捉える能力が高いです。
    * 機械翻訳、構文解析など、複数のタスクに適用可能です。
    * 注意機構によりモデルの解釈可能性が向上します。
* **その他:**
    * 位置エンコーディングにより、モデルはシーケンス内のトークンの順序を認識します。
    * 残差ドロップアウトやラベルスムージングなどの正則化手法を使用しています。

**結論:**

Transformerは、シーケンス変換モデルにおけるパラダイムシフトとなる可能性を示しました。注意機構のみに基づいたこのアーキテクチャは、従来のRNNやCNNモデルよりも高速かつ高精度な学習を可能にし、今後の自然言語処理分野の研究に大きな影響を与えると考えられます。

音声

音声データを読み込ませる。サンプルはGoogleのポッドキャストみたい。

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        Part.from_uri(
            file_uri="https://traffic.libsyn.com/secure/e780d51f-f115-44a6-8252-aed9216bb521/KPOD242.mp3",
            mime_type="audio/mpeg",
        ),
        "このポッドキャストエピソードの要約を簡潔にまとめて。",
    ],
)

print(response.text)

出力

もちろん。 このポッドキャストエピソードでは、Kubernetesのポッドキャストが、最近の出来事やニュース、そしてKubernetesコミュニティのメンバーへのインタビューについて語っています。

主な内容：

Cert-managerとDapperが、CNCF卒業プロジェクトになりました。
Istio 1.24がリリースされ、Istio Ambient MeshがGAになりました。
CNCFが、クラウドネイティブヒーローズチャレンジを発表しました。
2025年のKubernetesイベントのスケジュールが発表されました。
3つの新しいクラウドネイティブ認定資格が、発表されました。
Linux Foundationが、主要なKubernetes認定資格とLinux認定システム管理者試験の価格を値上げすることを発表しました。
WASMCloudが、CNCFインキュベーションプロジェクトになりました。
Spectre Cloudが、シリーズCの資金調達で7500万ドルを集めました。
Solo.ioが、Glue APIゲートウェイをCNCFに寄贈しました。
キューブコンで、参加者はこのイベントから何を求めているか、そして彼らが見ているトレンドについて語っています。
インタビューでは、AI、セキュリティ、接続、コラボレーションに関する大きなトレンドが取り上げられています。
このイベントの重要なポイントは、安全なワークロードを実行すること、GPUの監視、Kubernetesの機能拡張を支援するリソースのスケジューリングなどです。

動画

YouTubeの動画を読み込ませる。

video = Part.from_uri(
    file_uri="https://www.youtube.com/watch?v=3KtWfp0UopM",
    mime_type="video/mp4",
)

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        video,
        "ハリー・ポッターはビデオのどの場面で登場しますか？",
    ],
)

print(response.text)

出力

ハリー・ポッターのキャラクターは、56秒から1分0秒にかけて登場します。

出力制御

Structured Output。出力スキーマはPydanticモデル or JSONで定義できる。以下はPydanticモデルの例。

from pydantic import BaseModel


class Recipe(BaseModel):
    name: str
    description: str
    ingredients: list[str]


response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="人気のあるクッキーのレシピとその材料を教えて。",
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=Recipe,
    ),
)

print(response.text)

出力

{
  "description": "クラシックなチョコレートチップクッキーは、サクサクしたエッジと柔らかい中心部が特徴で、どんな集まりにも最適です。",
  "ingredients": [
    "バター 1カップ（室温に戻したもの）",
    "グラニュー糖 3/4カップ",
    "ブラウンシュガー 3/4カップ",
    "卵 2個",
    "バニラエッセンス 大さじ1",
    "薄力粉 2 1/4カップ",
     "ベーキングソーダ 小さじ1",
    "塩 小さじ1",
    "チョコレートチップ 2カップ"
  ],
  "name": "チョコレートチップクッキー"
}

Pythonの辞書でもスキーマは定義できる。

response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "ARRAY",
        "items": {
            "type": "OBJECT",
            "properties": {
                "rating": {"type": "INTEGER"},
                "flavor": {"type": "STRING"},
                "sentiment": {
                    "type": "STRING",
                    "enum": ["POSITIVE", "NEGATIVE", "NEUTRAL"],
                },
                "explanation": {"type": "STRING"},
            },
            "required": ["rating", "flavor", "sentiment", "explanation"],
        },
    },
}

prompt = """
以下の製品レビューを分析し、感情分類を出力して説明してください。

- とても気に入りました！今まで食べた中で最高のアイスクリーム」 評価 4, フレーバー ストロベリーチーズケーキ
- "かなりおいしいが、私の好みには少し甘すぎる」 評価 1, フレーバー マンゴータンゴ
"""

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=prompt,
    config=GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=response_schema,
    ),
)

print(response.text)

出力

[
  [
    {
      "explanation": "The reviewer expresses strong positive sentiment using words like '気に入りました' and '最高のアイスクリーム'. The rating of 4 is also indicative of a positive experience.",
      "flavor": "ストロベリーチーズケーキ",
      "rating": 4,
      "sentiment": "POSITIVE"
    },
    {
      "explanation": "The reviewer finds the ice cream 'おいしい' but also mentions it is '少し甘すぎる', which is a negative aspect. The rating of 1 suggests an overall negative experience.",
      "flavor": "マンゴータンゴ",
      "rating": 1,
      "sentiment": "NEGATIVE"
    }
  ]
]

入力トークンの確認

count_tokens()を使うと、API送信前に入力トークンのカウントが可能。

response = client.models.count_tokens(
    model="gemini-2.0-flash-exp",
    contents="アフリカで一番高い山は？",
)

print(response)

出力

total_tokens=6 cached_content_token_count=None

compute_tokens()を使うと、APIの代わりにローカルのトークナイザーを使って、より詳細な情報を確認できる。

response = client.models.compute_tokens(
    model="gemini-2.0-flash-exp",
    contents="英語で一番長い単語は？",
)

print(response)

出力

tokens_info=[TokensInfo(role='user', token_ids=['30719', '231294', '72010', '162143', '235418', '235544'], tokens=[b'\xe8\x8b\xb1\xe8\xaa\x9e', b'\xe3\x81\xa7\xe4\xb8\x80\xe7\x95\xaa', b'\xe9\x95\xb7\xe3\x81\x84', b'\xe5\x8d\x98\xe8\xaa\x9e', b'\xe3\x81\xaf', b'\xef\xbc\x9f'])]

kun432

（文字数オーバーしたので上の続き）

検索ツール

ツールとして検索を使うことで、より正確・最新・関連度の高い応答生成が可能となる。対応しているのは以下。

Google検索
Vertex AI Search

今回はGoogle検索で試してみる。

from google.genai.types import Tool, GoogleSearch

google_search_tool = Tool(google_search=GoogleSearch())

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="2024年の有馬記念の枠順を教えて。",
    config=GenerateContentConfig(tools=[google_search_tool]),
)

print(response.text)

出力

2024年の有馬記念の枠順は以下の通りです。

*   1枠1番：ダノンデサイル（牡3、横山典弘騎手）
*   1枠2番：ドウデュース（牡5、武豊騎手）
*   2枠3番：アーバンシック（牡3、C.ルメール騎手）
*   2枠4番：ブローザホーン（牡5、菅原明良騎手）
*   3枠5番：ベラジオオペラ（牡4、横山和生騎手）
*   3枠6番：ローシャムパーク（牡5、T.マーカンド騎手）
*   4枠7番：スターズオンアース（牝5、川田将雅騎手）
*   4枠8番：レガレイラ（牝3、戸崎圭太騎手）
*   5枠9番：ディープボンド（牡7、幸英明騎手）
*   5枠10番：プログノーシス（牡6、三浦皇成騎手）
*   6枠11番：ジャスティンパレス（牡5、坂井瑠星騎手）
*   6枠12番：シュトルーヴェ（セ5、鮫島克駿騎手）
*   7枠13番：スタニングローズ（牝5、R.ムーア騎手）
*   7枠14番：ダノンベルーガ（牡5、松山弘平騎手）
*   8枠15番：ハヤヤッコ（牡8、吉田豊騎手）
*   8枠16番：シャフリヤール（牡6、C.デムーロ騎手）

2024年の有馬記念は12月22日(日)に中山競馬場で行われます。

根拠とされている検索結果等については以下で参照できる。

from IPython.display import HTML

# 根拠としたURLなどが含まれる 
print(response.candidates[0].grounding_metadata)

# 使用した検索キーワードでのGoogle検索へのリンクが生成される
display(HTML(response.candidates[0].grounding_metadata.search_entry_point.rendered_content))

Function Calling

Function Callingは2つの指定方法がある。

Pythonの関数を渡す（Automatic Function Calling）
OpenAPI定義を渡す（Maunual Function Calling）

Automaticの場合。おそらく関数の型ヒントやdocstringなどがモデルには送信されているのだと思う。

def get_current_weather(location: str) -> str:
    """与えられた地名等から現在の天気を返す

    引数:
        location: 都道府県や市区町村名 例: 東京、兵庫県、神戸市
    """
    import random
    weather = random.choice(["晴れ", "雨", "雪", "霧"])
    print(f"デバッグ: 入力 -> {location}")
    print(f"デバッグ: 出力 -> {weather}")
    return weather


response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="神戸の天気について教えて。",
    config=GenerateContentConfig(
        tools=[get_current_weather],
        temperature=0,
    ),
)

print("回答:", response.text)

出力

デバッグ: 入力 -> 神戸
デバッグ: 出力 -> 晴れ
回答: 神戸の天気は晴れです。

Manualの場合。

from google.genai.types import FunctionDeclaration

get_destination = FunctionDeclaration(
    name="get_destination",
    description="Get the destination that the user wants to go to",
    parameters={
        "type": "OBJECT",
        "properties": {
            "destination": {
                "type": "STRING",
                "description": "Destination that the user wants to go to",
            },
        },
    },
)

destination_tool = Tool(
    function_declarations=[get_destination],
)

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="I'd like to travel to Paris.",
    config=GenerateContentConfig(
        tools=[destination_tool],
        temperature=0,
    ),
)

print(response.candidates[0].content.parts[0].function_call)

出力

id=None args={'destination': 'Japan'} name='get_destination'

こちらはなぜか日本語だとダメだった。スキーマというよりは送信するプロンプトが日本語だとダメみたい。

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="パリに旅行したいです。",
    config=GenerateContentConfig(
        tools=[destination_tool],
        temperature=0,
    ),
)

出力

None

コード実行

Gemini 2.0はコード実行もできる？

from google.genai.types import Tool, GenerateContentConfig

code_execution_tool = Tool(code_execution={})

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="20番目のフィボナッチ数を計算します。そしてそれに最も近い回文を求めなさい。",
    config=GenerateContentConfig(
        tools=[code_execution_tool],
        temperature=0,
    ),
)
for part in response.candidates[0].content.parts:
    if part.executable_code:
        print("言語:", part.executable_code.language)
        print("コード:", f"""
```
{part.executable_code.code}
```
""")
    if part.code_execution_result:
        print("結果:", part.code_execution_result.outcome)
        print(part.code_execution_result.output)

出力

言語: PYTHON
コード: 
```

def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        a, b = 0, 1
        for _ in range(2, n + 1):
            a, b = b, a + b
        return b

fib_20 = fibonacci(20)
print(f'{fib_20=}')

```

結果: OUTCOME_OK
fib_20=6765

これどこで実行されているんだろう？ドキュメントはここ。

コードの実行と関数呼び出し

コード実行と関数呼び出しは類似した機能です。

コード実行により、モデルは固定された分離された環境で API バックエンド内のコードを実行できます。

関数呼び出しを使用すると、モデルがリクエストする関数を任意の環境で実行できます。

どうやら実行環境が用意されるっぽい。その点でFunction Callingとは異なる。

一般に、ユースケースを処理できる場合は、コード実行を使用することをおすすめします。コード実行は使いやすく（有効にするだけ）、1 回の GenerateContent リクエストで解決されるため（1 回分の料金が発生します）。関数呼び出しでは、各関数呼び出しの出力を送り返すために追加の GenerateContent リクエストが必要になります（そのため、複数の料金が発生します）。

なるほど、モデルへのAPIリクエストで処理される。Function Callingは関数の引数を生成して、それを実行した結果を送信しないと行けないと思うので、リクエストの観点でもメリットがある。

制限事項

モデルはコードの生成と実行のみを行います。メディアファイルなどの他のアーティファクトを返すことはできません。

この機能は、ファイル I/O や、テキスト以外の出力（データのグラフや CSV ファイルのアップロードなど）を伴うユースケースはサポートしていません。

コードの実行は、タイムアウトするまで最大 30 秒間実行できます。

コード実行を有効にすると、モデル出力の他の領域（ストーリーの作成など）で回帰が発生することがあります。

コード実行を正常に使用できるモデルの能力には、若干のばらつきがあります。Google のテストでは、Gemini 1.5 Pro が最も優れたパフォーマンスを発揮しました。

まあ何かしら制約はあるよね。そのあたりも含めてやはりCodeInterpreterに近い。

以下については別途。

マルチモーダルライブAPI

kun432

用意されている一連のnotebookをざっと見てみたけど、シンプルにモデルそのものでできることを知るならば、Gemini Cookbookのほうが良さそうに思う。
Gemini CookbookのGet Startedは、上でやった「Intro to Gemini 2.0 Flash」とだいたい同じような内容に思える。
https://github.com/google-gemini/cookbook/tree/main/gemini-2/get_started.ipynb
ということで、他のGemini Cookbookを順に見ていこうと思う。

kun432

マルチモーダルライブAPIについては、単体でボリュームあるように思えるので、こちらに分けた。

kun432

2D spatial understanding with Gemini 2.0

2次元空間認識。

Gen AI SDKをインストール。

!pip install -U -q google-genai

APIキーをセット

from google.colab import userdata
import os

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

サンプルで使用する画像をダウンロード

!wget https://storage.googleapis.com/generativeai-downloads/images/socks.jpg -O Socks.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/vegetables.jpg -O Vegetables.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/Japanese_Bento.png -O Japanese_bento.png -q
!wget https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg -O Cupcakes.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/origamis.jpg -O Origamis.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/fruits.jpg -O Fruits.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/cat.jpg -O Cat.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/pumpkins.jpg -O Pumpkins.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/breakfast.jpg -O Breakfast.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/bookshelf.jpg -O Bookshelf.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/spill.jpg -O Spill.jpg -q

各画像はこんな感じ

import os
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def display_images_grid(directory, grid_size=(4, 3)):
    """ディレクトリ内の画像をグリッド表示する関数"""
    image_dir = Path(directory)
    image_files = list(image_dir.glob("*.jpg")) + list(image_dir.glob("*.png"))
    
    if not image_files:
        print("指定されたディレクトリ内に画像ファイルはありません。")
        return
    
    fig, axes = plt.subplots(*grid_size, figsize=(15, 15))
    axes = axes.flatten()
    
    for ax, image_path in zip(axes, image_files):
        img = mpimg.imread(image_path)
        ax.imshow(img)
        ax.axis("off")
        ax.set_title(image_path.name, fontsize=8)
    
    for ax in axes[len(image_files):]:
        ax.axis("off")
    
    plt.tight_layout()
    plt.show()

display_images_grid(".")

上のカップケーキ画像から個々のカップケーキを検出し、それぞれの画像内の座標を返させてみる。

from google import genai
from google.genai import types
from PIL import Image
from io import BytesIO

# クライアント初期化
client = genai.Client(api_key=GOOGLE_API_KEY)

# モデル名を設定
model_name = "gemini-2.0-flash-exp"

# 画像から物体を検出、バウンディングボックスをJSONオブジェクトで返すシステムプロンプトを設定（必須とドキュメントにはある）
# 「
#   バウンディングボックスをラベル付きのJSON配列として返して。マスクやコードフェンシングは返さないで。オブジェクトは25個まで。
#   オブジェクトが複数存在する場合は、固有の特性（色、サイズ、位置、固有の特性など）に応じて名前を付けて。
# 」
bounding_box_system_instructions = """\
Return bounding boxes as a JSON array with labels. Never return masks or code fencing. Limit to 25 objects.
If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..).
"""

# 安全性の設定（必須とドキュメントにはある）
safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

# ユーザプロンプト
# 「カップケーキの2次元バウンディングボックスを（トッピングの詳細をラベルにして）検出して」
prompt = """\
Detect the 2d bounding boxes of the cupcakes (with “label” as topping description”)
"""

# 画像の読み込み
image = "Cupcakes.jpg"
img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)

# モデルに送信
response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

# 出力
print(response.text)

出力

```json
[
  {"box_2d": [394, 67, 559, 202], "label": "red sprinkles"},
  {"box_2d": [366, 250, 527, 370], "label": "pink with blue dots"},
  {"box_2d": [363, 396, 497, 506], "label": "pink with white dots"},
  {"box_2d": [352, 510, 523, 655], "label": "pink with blue dots"},
  {"box_2d": [443, 429, 596, 566], "label": "pink with red dot"},
  {"box_2d": [468, 675, 639, 823], "label": "white with colorful sprinkles"},
  {"box_2d": [553, 57, 728, 197], "label": "white with colorful sprinkles"},
  {"box_2d": [537, 291, 704, 448], "label": "white with googly eyes"},
    {"box_2d": [377, 738, 528, 862], "label": "chocolate"},
  {"box_2d": [538, 548, 698, 698], "label": "white with googly eyes"},
    {"box_2d": [479, 828, 662, 978], "label": "white with colorful sprinkles"},
  {"box_2d": [651, 350, 813, 498], "label": "white with googly eyes"},
  {"box_2d": [656, 135, 888, 307], "label": "white with googly eyes"}
]
```

この出力結果を使って、画像にバウンディングボックスをオーバーレイさせる。

日本語フォントをインストール

!apt-get install fonts-noto-cjk

出力結果をパースする関数と、画像にバウンディングボックスを描画する関数を定義

import json
import random
import io
from PIL import Image, ImageDraw, ImageFont
from PIL import ImageColor

additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]

def plot_bounding_boxes(im, bounding_boxes):
    """
    PILを使用して、正規化された座標と異なる色でバウンディングボックスを
    画像上にプロットし、それぞれの名前をマーカーとして表示する

    引数:
        img_path:
          画像ファイルへのパス
        bounding_boxes:
          オブジェクトの名前と正規化された[y1 x1 y2 x2]形式での位置を含む
          バウンディングボックスのリスト
    """

    # 画像をロード
    img = im
    width, height = img.size
    print(img.size)
    # 描画オブジェクトを作成
    draw = ImageDraw.Draw(img)

    # 色リストの定義
    colors = [
        'red',
        'green',
        'blue',
        'yellow',
        'orange',
        'pink',
        'purple',
        'brown',
        'gray',
        'beige',
        'turquoise',
        'cyan',
        'magenta',
        'lime',
        'navy',
        'maroon',
        'teal',
        'olive',
        'coral',
        'lavender',
        'violet',
        'gold',
        'silver',
    ] + additional_colors

    # モデルからの出力（Markdown）をパース
    bounding_boxes = parse_json(bounding_boxes)

    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", size=14)

    # バウンディングボックスごとに繰り返し
    for i, bounding_box in enumerate(json.loads(bounding_boxes)):
      # 色リストから色を選択
      color = colors[i % len(colors)]

      # Convert normalized coordinates to absolute coordinates
      abs_y1 = int(bounding_box["box_2d"][0]/1000 * height)
      abs_x1 = int(bounding_box["box_2d"][1]/1000 * width)
      abs_y2 = int(bounding_box["box_2d"][2]/1000 * height)
      abs_x2 = int(bounding_box["box_2d"][3]/1000 * width)

      if abs_x1 > abs_x2:
        abs_x1, abs_x2 = abs_x2, abs_x1

      if abs_y1 > abs_y2:
        abs_y1, abs_y2 = abs_y2, abs_y1

      # バウンディングボックスを描画
      draw.rectangle(
          ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=4
      )

      # テキストを描画
      if "label" in bounding_box:
        draw.text((abs_x1 + 8, abs_y1 + 6), bounding_box["label"], fill=color, font=font)

    # 画像を表示
    img.show()

def parse_json(json_output):
    """
    JSON出力からMarkdown形式のフェンス（```json）を取り除く関数

    引数:
        json_output: Markdown形式のコードブロックで囲まれたJSON文字列

    戻り値:
        コードブロックを取り除いた純粋なJSON文字列
    """
    # Parsing out the markdown fencing
    lines = json_output.splitlines()
    for i, line in enumerate(lines):
        if line == "```json":
            json_output = "\n".join(lines[i+1:])       # ```json より前を除去
            json_output = json_output.split("```")[0]  # ``` 以降を除去
            break  # 最初の```json を処理したら終了
    return json_output

画像上にバウンディングボックスをプロットして表示

plot_bounding_boxes(im, response.text)
im

特定の条件を満たしたオブジェクトだけを返す。複数の靴下から、顔がある靴下だけを抽出する。

# 靴下画像
image = "Socks.jpg"

# 「顔柄のある靴下の位置を教えて」
prompt = "Show me the positions of the socks with the face"

img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

print(response.text)

# Generate image with bounding boxes
plot_bounding_boxes(im, response.text)
im

他のプロンプトも試してみる。

# 「すべての虹色の靴下を検出して」
prompt = "Detect all rainbow socks"

# 「すべての靴下を探して絵文字でラベルをつけて」
prompt = "Find all socks and label them with emojis"

画像の絵文字は表示されてないけど、多分フォントの問題かな？

# 「一番上の靴下に合う靴下を探して」
prompt = "Find the sock that goes with the one at the top"

マルチリンガルにも対応、つまり文字を認識できる。日本の食べ物画像とその日本語テキストが含まれている画像から、物体検出＋テキスト文字＋その翻訳をラベルとして描画する。

# 食べ物の画像とその日本語テキストがリストになっている画像
image = "Japanese_bento.png"

# 「食べ物を検出して、日本語の文字と英語の翻訳でラベルをつけて。」
prompt = "Detect food, label them with Japanese characters + english translation."  # @param ["Detect food, label them with Japanese characters + english translation.", "Show me the vegan dishes","Explain what those dishes are with a 5 words description","Find the dishes with allergens and label them accordingly"] {"allow-input":true}

img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

plot_bounding_boxes(im, response.text)
im

他のプロンプトも。

# 「ヴィーガン向けを検出して」
prompt = "Show me the vegan dishes"

# 「その食べ物がどのようなものか、5語で説明してください」
prompt = "Explain what those dishes are with a 5 words description"

# 「アレルギー名をラベルして」（注：サンプルのプロンプトだとうまくいかなかったので修正）
prompt = "Label with allergens"

画像に対するより高度な推論。

# 折り紙の画像
image = "Origamis.jpg"
# 「狐の影を四角で囲んで」
prompt = "Draw a square around the fox' shadow"
img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

plot_bounding_boxes(im, response.text)
im

他のプロンプト

# 「２つの動物の折り紙を探して」
prompt = "Find the two origami animals."

# 「折り紙の影はどこ？」
prompt = "Where are the origamis' shadows?"

ラベルにモデルの知識を反映させる。

# 液体が溢れている画像
image = "Spill.jpg"
# 「テーブルをきれいにする方法をラベルの説明とともに教えて。」
prompt = "Tell me how to clean my table with an explanation as label"

img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

plot_bounding_boxes(im, response.text)
im

他のプロンプト。

# 「コーヒーがこぼれた場所を教えて」
prompt = "Show me where my coffee was spilled."

kun432

 Pointing and 3D Spatial Understanding with Gemini 2.0 (Experimental)https://github.com/google-gemini/cookbook/tree/main/gemini-2/spatial_understanding_3d.ipynb
こちらは3次元空間認識で、一応experimentalというステータス。レスポンスの処理が結構複雑なので説明とかは割愛するけど、試してみると、こんな感じで距離なども含めた3D空間での物体検出やプロットがある程度できるのがわかる。

kun432

2D/3D共に、座標のズレとか物体の誤検出はそれなりに起きることがある印象。

kun432

Introduction to Gemini 2.0 Flash Thinking

より高度な推論向けにgemini-2.0-flash-thinking-expというモデルが提供されている。OpenAIでいうところのo1と同じような位置づけかな？

Gen AI SDKをインストール。

!pip install -U -q google-genai

APIキーをセット

from google.colab import userdata
import os

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

推論

%%time
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
    model="gemini-2.0-flash-thinking-exp",
    contents='このコードをシンプルにして: `(Math.round(radius/pixelsPerMile * 10) / 10).toFixed(1);`'
)

それなりに時間がかかる。

出力

CPU times: user 67.6 ms, sys: 11.9 ms, total: 79.6 ms
Wall time: 13.1 s

レスポンスは複数のパートに分かれている。

print(len(response.candidates[0].content.parts))

出力

それぞれの出力は以下となっているらしい。

1つ目: モデルの「内なる考え」、つまり、問題の分析や戦略の検討などの推論過程が含まれる。
2つ目: 上記を踏まえた、モデルの最終的な回答。

1つ目の出力を見てみる。

print(response.candidates[0].content.parts[0].text)

出力

Here's a thinking process to arrive at the simplified code:

1. **Understand the Goal:** The original code calculates a value based on `radius`, `pixelsPerMile`, rounds it to one decimal place, and returns it as a string.

2. **Break Down the Original Code:**  Let's dissect the steps:
   - `radius / pixelsPerMile`:  This performs the core calculation, converting the radius to some other unit based on `pixelsPerMile`. Let's call the result `intermediate_value`.
   - `intermediate_value * 10`: This shifts the decimal one place to the right.
   - `Math.round(...)`: This rounds the shifted value to the nearest integer.
   - `/ 10`: This shifts the decimal back one place to the left.
   - `.toFixed(1)`: This converts the result to a string with exactly one digit after the decimal point.

3. **Identify Redundancy:** The `* 10` and `/ 10` operations combined with `Math.round()` are a common way to round to a specific decimal place. In this case, they are specifically rounding to the nearest tenth.

4. **Consider `Math.round()` Directly on the Scaled Value:**  Can we achieve the rounding to one decimal place more directly?  Yes, we can combine the multiplication and division within the `Math.round()` call. Instead of rounding `intermediate_value * 10`, we can think of it as rounding to the nearest tenth *of the original value*.

5. **Introduce Intermediate Variable (Optional but Helpful for Clarity):** At this stage, it might be helpful to introduce a variable to represent the core calculation:

   ```javascript
   const value = radius / pixelsPerMile;
   ```

6. **Apply Rounding to One Decimal Place:** Now, we need to round `value` to one decimal place. One way to do this is to multiply by 10, round, and divide by 10:

   ```javascript
   const roundedValue = Math.round(value * 10) / 10;
   ```

7. **Format as String:**  Finally, we need to format the result as a string with one decimal place:

   ```javascript
   roundedValue.toFixed(1);
   ```

8. **Combine the Steps:** Now, let's combine these steps into a single line (while maintaining readability):

   ```javascript
   const value = radius / pixelsPerMile;
   const roundedValue = Math.round(value * 10) / 10;
   return roundedValue.toFixed(1);
   ```

9. **Further Simplification (Direct Calculation):** We can eliminate the intermediate `value` variable if desired:

   ```javascript
   const roundedValue = Math.round(radius / pixelsPerMile * 10) / 10;
   return roundedValue.toFixed(1);
   ```

10. **Consider `Math.round()` Alternatives:** Are there other ways to round to a specific decimal place? Yes, `Math.floor()` and `Math.ceil()` could be used with adjustments, but for simple rounding to the nearest tenth, `Math.round()` is appropriate.

11. **Final Simplification (One-Liner):**  We can further condense this into a single line:

   ```javascript
   return (Math.round(radius / pixelsPerMile * 10) / 10).toFixed(1);
   ```

12. **Reflect and Evaluate:**  Is this simpler?  It's arguably the same number of operations, but perhaps slightly more readable due to the directness of the rounding logic. The most significant simplification comes from understanding the purpose of the original multiplication and division by 10.

13. **Consider Alternatives (More Concise):**  Are there even *more* concise ways?  While the original code is relatively concise, using string manipulation can be another approach, though potentially less efficient for purely numerical operations:

    ```javascript
    return (Math.round(radius / pixelsPerMile * 10) / 10).toString().split('.').map((v, i) => i === 1 ? v.slice(0, 1) : v).join('.');
    ```

    This is *more* concise in terms of characters, but arguably less readable and less efficient for this specific task. Therefore, focusing on the direct rounding approach is better.

14. **Final Conclusion:** The most straightforward and readable simplification that maintains the intended functionality is:

    ```javascript
    const value = Math.round(radius / pixelsPerMile * 10) / 10;
    return value.toFixed(1);
    ```

    or the single-line version:

    ```javascript
    return (Math.round(radius / pixelsPerMile * 10) / 10).toFixed(1);
    ```

２つ目の出力

print(response.candidates[0].content.parts[1].text)

出力

コードをシンプルにする方法はいくつかあります。以下にいくつかのオプションを示します。

**1. 中間変数を導入して可読性を向上させる:**

```javascript
const calculateValue = radius / pixelsPerMile;
const roundedValue = Math.round(calculateValue * 10) / 10;
return roundedValue.toFixed(1);
```

この方法は、計算の各ステップを明確にし、コードの意図を理解しやすくします。

**2. 計算を直接 `toFixed()` に組み込む (可読性は少し下がる):**

```javascript
return (radius / pixelsPerMile).toFixed(1);
```

この方法は最も短く簡潔ですが、小数点以下の桁数を制御していません。元のコードと同じ結果を得るためには、内部で丸め処理を行う必要があります。

**3. 内部の丸め処理をより明確にする (元のコードとほぼ同じですが、可読性向上):**

```javascript
const value = radius / pixelsPerMile;
const roundedToTenth = Math.round(value * 10) / 10;
return roundedToTenth.toFixed(1);
```

この方法は、元のコードの意図をより明確にしつつ、変数名で処理内容を示しています。

**4. より簡潔に (可読性は少し下がる):**

```javascript
return parseFloat((Math.round(radius / pixelsPerMile * 10) / 10).toFixed(1));
```

この方法は、`toFixed(1)` が文字列を返すため、数値として使用したい場合に `parseFloat()` で数値に変換しています。ただし、今回のケースでは `.toFixed(1)` で文字列として返すのが目的であれば、`parseFloat()` は不要です。

**どれが最適か？**

最もシンプルなのは **オプション 2** (`return (radius / pixelsPerMile).toFixed(1);`) ですが、これは小数点以下の桁数を制御していません。もし常に小数点以下1桁に丸める必要があるなら、**オプション 1** または **オプション 3** が可読性が高くおすすめです。

元のコードの意図を維持しつつ、可読性を少し向上させるという意味では、**オプション 3** が良い選択肢かもしれません。

**最終的なシンプルなコード (小数点以下1桁に丸める場合):**

```javascript
const value = radius / pixelsPerMile;
return (Math.round(value * 10) / 10).toFixed(1);
```

または、さらに簡潔に (可読性は少し下がる):

```javascript
return (Math.round(radius / pixelsPerMile * 10) / 10).toFixed(1);
```

どちらのシンプルコードを選ぶかは、**可読性とコードの短さのどちらを重視するか**によって異なります。一般的には、可読性を優先する方が、後々のメンテナンスが容易になります。

response.textで上記の両方が1つの文字列として取得できる。

print(response.text)

通常の使い方だと、推論過程までは不要・回答だけで十分かなと思うので、response.textよりもresponse.candidates[0].content.parts[1].textを使ったほうが良さそうに思う。

画像を使うこともできる。

サンプルの画像を取得

!wget https://storage.googleapis.com/generativeai-downloads/images/geometry.png -O geometry.png -q

from PIL import Image

im = Image.open("geometry.png").resize((256,256))
im

画像に対する推論

from google import genai
from google.genai import types
from PIL import Image

client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
    model="gemini-2.0-flash-thinking-exp",
    contents=[
        Image.open("geometry.png").resize((256,256)),
        "重なっている部分の面積は？"
    ]
)

print("## Thoughts")
print(response.candidates[0].content.parts[0].text)
print("## Answer")
print(response.candidates[0].content.parts[1].text)

出力

## Thoughts

The user wants to know the area of the overlapping region between the circle and the triangle in the given image.

Let's analyze the shapes and their dimensions.
The circle has a radius of 3.
The triangle is a right triangle with legs of length 3.

The overlapping region appears to be a quarter of the circle.

The circle's area is given by the formula $A_{circle} = \pi r^2$.
In this case, $r = 3$, so $A_{circle} = \pi (3)^2 = 9\pi$.

The triangle's area is given by the formula $A_{triangle} = \frac{1}{2} \times base \times height$.
In this case, the base and height are both 3, so $A_{triangle} = \frac{1}{2} \times 3 \times 3 = \frac{9}{2}$.

Let's carefully examine the overlapping region. The vertices of the right angle of the triangle seem to coincide with the center of the circle. The legs of the triangle appear to align with the radii of the circle.

If the right angle vertex of the triangle is at the center of the circle and the legs of the triangle align with the radii, the overlapping area is indeed a sector of the circle. The angle of this sector is the angle of the right angle of the triangle, which is 90 degrees.

A sector with a central angle $\theta$ (in degrees) of a circle with radius $r$ has an area given by the formula:
$A_{sector} = \frac{\theta}{360} \times \pi r^2$

In this case, $\theta = 90^\circ$ and $r = 3$.
$A_{overlap} = \frac{90}{360} \times \pi (3)^2$
$A_{overlap} = \frac{1}{4} \times 9\pi$
$A_{overlap} = \frac{9\pi}{4}$

So, the area of the overlapping region is $\frac{9\pi}{4}$.

Let's double-check the assumptions made based on the image. The labels '3' indicate the lengths of the relevant segments. The triangle's legs are indeed of length 3, and the radius of the circle is 3. The positioning suggests the right angle of the triangle is at the center of the circle.

Confirmation of the assumptions:
- The circle has a radius of 3.
- The triangle is a right triangle with legs of length 3.
- The vertex of the right angle of the triangle is at the center of the circle.
- The legs of the triangle lie along the radii of the circle.

Given these assumptions, the overlapping region is a quarter of the circle.

Area of the circle = $\pi \times 3^2 = 9\pi$.
The overlapping region is formed by the two radii of the circle that form the legs of the triangle, and the arc of the circle connecting the ends of these radii. This corresponds to a sector of the circle with a central angle of 90 degrees.

Fraction of the circle = $\frac{90}{360} = \frac{1}{4}$.
Area of the overlapping region = $\frac{1}{4} \times \text{Area of the circle}$
Area of the overlapping region = $\frac{1}{4} \times 9\pi = \frac{9\pi}{4}$.

The question asks for the area of the overlapping part. The overlapping part is the region shared by both the circle and the triangle. Based on the image, this region is a sector of the circle.

The sector is defined by the two radii that form the legs of the triangle. Since the angle between these radii is the right angle of the triangle ($90^\circ$), the sector is a quarter of the circle.

Area of the overlapping part = Area of the sector = $\frac{9\pi}{4}$.

We can calculate the numerical value as well.
$\pi \approx 3.14159$
Area $\approx \frac{9 \times 3.14159}{4} = \frac{28.27431}{4} = 7.0685775$

The question is in Japanese, asking for the area of the overlapping part: "重なっている部分の面積は？".

The overlapping part is indeed the quarter circle sector.

Let's reconfirm the understanding of the image. The labels '3' on the triangle indicate the lengths of its legs. The labels '3' on the circle indicate radii. The orientation clearly shows the overlap is a sector.

Final answer should be in terms of $\pi$.

Consider any alternative interpretations:
- Could the triangle be positioned differently? The image seems quite specific.
- Are there any ambiguities in the question? "重なっている部分" clearly refers to the intersection.

The geometry is straightforward: a right triangle with legs equal to the radius of the circle, with the right angle vertex at the center.

Consider if the question implies any complexities not immediately obvious. The simplicity of the dimensions (all 3) suggests a straightforward calculation.

If the triangle's vertices were not aligned with the circle's center and circumference as shown, the problem would be much more complex, involving segments of circles and intersections of lines and circles. However, the visual representation is clear.

The area of the overlapping region is the area of the quarter circle.

Area = $\frac{1}{4} \pi r^2 = \frac{1}{4} \pi (3)^2 = \frac{9\pi}{4}$.

The answer should be presented in Japanese, as the question was. The value should ideally be in terms of $\pi$.

Answer format:  Area = [value]

重なっている部分の面積 = $\frac{9\pi}{4}$

Let's ensure the units are consistent (no units given, so the result is in terms of area units).

Could there be a need for approximation? The question doesn't specify, so the exact value is preferred.

Consider if any other areas need to be calculated (e.g., areas of segments, etc.). The question directly asks for the overlapping area, which is the intersection.

The overlap is a circular sector.

Final confirmation of the understanding:
- Circle centered at some point.
- Right triangle with one vertex at the circle's center.
- Legs of the triangle aligned with radii.

The image supports this interpretation.

The area of the overlapping part is indeed the area of the circular sector with a central angle of 90 degrees and radius 3.

Area = $\frac{\theta}{360} \times \pi r^2 = \frac{90}{360} \times \pi (3)^2 = \frac{1}{4} \times 9\pi = \frac{9\pi}{4}$.

The wording of the question is direct: "重なっている部分の面積は？" which translates to "What is the area of the overlapping part?".

The overlapping part is the region common to both the circle and the triangle.

Consider edge cases or alternative configurations, but the provided image is definitive.

If the question asked for the area of the union or other combinations, the approach would be different.

Area of the union = Area of circle + Area of triangle - Area of intersection
Area of circle = $9\pi$
Area of triangle = $\frac{1}{2} \times 3 \times 3 = \frac{9}{2}$
Area of intersection = $\frac{9\pi}{4}$
Area of union = $9\pi + \frac{9}{2} - \frac{9\pi}{4} = \frac{36\pi}{4} + \frac{18}{4} - \frac{9\pi}{4} = \frac{27\pi + 18}{4}$

However, the question specifically asks for the area of the overlapping part.

Final check of the calculation:
Area of sector = $\frac{1}{4} \times \pi \times 3^2 = \frac{9\pi}{4}$.

The answer seems robust given the image and question. No further complexities seem implied.
The simplicity of the numbers (all 3) reinforces the idea of a straightforward geometric calculation.

If the angle was different, or the dimensions varied, the problem would involve integrating or more complex geometric formulas.

The visual alignment is key: the right angle at the center, legs along radii.

## Answer
回答：
重なっている部分は、中心角が90度の扇形です。

扇形の面積の公式は、
$$ \text{扇形の面積} = \frac{\text{中心角}}{360} \times \pi r^2 $$
ここで、$r$ は円の半径です。

この問題では、円の半径は3であり、中心角は三角形の直角である90度です。

したがって、重なっている部分の面積は、
$$ \text{重なっている部分の面積} = \frac{90}{360} \times \pi (3)^2 $$
$$ = \frac{1}{4} \times \pi \times 9 $$
$$ = \frac{9\pi}{4} $$

重なっている部分の面積は $\frac{9\pi}{4}$ です。

Final Answer: The final answer is $\boxed{\frac{9\pi}{4}}$

回答：
重なっている部分は、中心角が90度の扇形です。

扇形の面積の公式は、
$\text{扇形の面積} = \frac{\text{中心角}}{360} \times \pi r^2$
ここで、 $r$ は円の半径です。

この問題では、円の半径は3であり、中心角は三角形の直角である90度です。

したがって、重なっている部分の面積は、
$\text{重なっている部分の面積} = \frac{90}{360} \times \pi (3)^2 = \frac{1}{4} \times \pi \times 9 = \frac{9\pi}{4}$
重なっている部分の面積は $\frac{9\pi}{4}$ です。

Final Answer: The final answer is $\boxed{\frac{9\pi}{4}}$

kun432

（字数オーバーしたので続き）

別の画像。少しひねった問題。

サンプル画像を取得

!wget https://storage.googleapis.com/generativeai-downloads/images/pool.png -O pool.png -q

im = Image.open("pool.png")
im

response = client.models.generate_content(
    model="gemini-2.0-flash-thinking-exp",
    contents=[
        Image.open("pool.png"),
        "これらの数字を3つ使って合計を30にするには？"
    ]
)

print("## Thoughts")
print(response.candidates[0].content.parts[0].text)
print("## Answer")
print(response.candidates[0].content.parts[1].text)

結果。これは何度か試さないと結果にたどり着かなかった。

出力

## Thoughts
The user wants to find a combination of three numbers from the image (7, 9, 11, 13) that sum up to 30.

Let's try different combinations:
- 7 + 9 + 11 = 27 (too low)
- 7 + 9 + 13 = 29 (too low)
- 7 + 11 + 13 = 31 (too high)
- 9 + 11 + 13 = 33 (too high)

It seems that no combination of three *distinct* numbers from the image sums up to 30.

However, the question doesn't explicitly state that the numbers have to be distinct. Is it possible to use the same number more than once? The image only shows each number once. Perhaps there's a trick or a misunderstanding in the question.

Let's re-read the question in Japanese: 「これらの数字を3つ使って合計を30にするにはどうすればいい？ 」  It asks "how can we make the sum 30 by using these three numbers?"  It does not specify that the numbers need to be distinct.

If we are allowed to use a number more than once, we need to find three numbers (not necessarily distinct) from the set {7, 9, 11, 13} that sum to 30.

Trying combinations with repeated numbers:
- Can we use three of the same number? 3 * 7 = 21, 3 * 9 = 27, 3 * 11 = 33, 3 * 13 = 39 (none of these work)

Can we use two of one number and one of another?
- Two 7s: 7 + 7 + x = 30 => 14 + x = 30 => x = 16 (not in the set)
- Two 9s: 9 + 9 + x = 30 => 18 + x = 30 => x = 12 (not in the set)
- Two 11s: 11 + 11 + x = 30 => 22 + x = 30 => x = 8 (not in the set)
- Two 13s: 13 + 13 + x = 30 => 26 + x = 30 => x = 4 (not in the set)

Let's consider a different interpretation. Could the question mean using the *digits* of the numbers?  No, the question specifically states "these numbers".

Perhaps there is a misunderstanding about how to use the numbers. Could we manipulate them mathematically in some other way?  The question is quite direct, asking for a sum.

Let's revisit the initial approach and double-check my sums.
- 7 + 9 + 11 = 27
- 7 + 9 + 13 = 29
- 7 + 11 + 13 = 31
- 9 + 11 + 13 = 33

The question seems to have no straightforward answer using the given numbers for a simple addition of three of them.

However, there's a known riddle involving this type of problem, often using the number 9. The key is to think outside the standard decimal representation. Could there be a way to represent one of the numbers differently?

Could we flip a number?  Flipping 9 could make it a 6, but the question uses the numbers themselves.

Could there be a trick with the way the question is phrased?  "How can we make the sum 30 by using these three numbers?"  Does it imply we have to use each number exactly once?  No, it says "using three of these numbers."

Perhaps the question has an error or a trick. The most common trick with sums to 30 involving the number 9 is to consider it as 9. If we are allowed to reuse numbers, there's still no direct solution.

Let's consider if we can combine digits in any way. No, the question refers to "these numbers".

The most likely explanation is that there's a trick to the question related to representing numbers differently.

Consider the possibility of using the number 9 in a different base, or perhaps considering the image itself rather than just the numerical values. However, the question directly asks about the numbers.

Let's go back to the idea of using the numbers more than once. If the question allows it, we explored all combinations of two identical numbers and one different.

Perhaps the issue lies in a misunderstanding of the question's intent. It's a common type of riddle, and usually involves some manipulation of the numbers themselves.

Could there be a visual trick?  No, the question is about the numerical values.

The only remaining possibility is some kind of non-standard interpretation or a trick. The classic riddle involves using the digit 9, so let's focus on that.

If we are allowed to use the numbers more than once, we need to find x + y + z = 30, where x, y, and z are from {7, 9, 11, 13}.

Could the question involve a different operation?  No, it specifically asks for a "sum".

The riddle often involves changing the representation of a number. For example, considering the number 9 visually, but the question is about the *numbers*.

Let's assume the question is solvable using basic arithmetic and the numbers provided. The issue must be in how we are using the numbers.

If we can use numbers more than once, let's think systematically. We need three numbers that add to 30. Since 11 and 13 are large, they are likely candidates for being used.

- 13 + 13 + x = 30 => x = 4 (not available)
- 11 + 11 + x = 30 => x = 8 (not available)
- 9 + 9 + x = 30 => x = 12 (not available)
- 7 + 7 + x = 30 => x = 16 (not available)

Consider combinations with two different numbers:
- 13 + 11 + x = 30 => x = 6 (not available)
- 13 + 9 + x = 30 => x = 8 (not available)
- 13 + 7 + x = 30 => x = 10 (not available)
- 11 + 9 + x = 30 => x = 10 (not available)
- 11 + 7 + x = 30 => x = 12 (not available)
- 9 + 7 + x = 30 => x = 14 (not available)

The only way this riddle typically works with the number 9 is if you can somehow represent 9 differently.

The phrasing "これらの数字を3つ使って" might suggest we have to select three numbers from the given set. It doesn't explicitly forbid repetition, but the natural interpretation is to pick three.

If we interpret "これらの数字" as the *values* 7, 9, 11, and 13, and we need to pick three to sum to 30, then my initial attempts cover this.

The most likely answer is that the riddle plays on the visual representation of the number 9. However, the prompt shows the numbers clearly.

Let's reconsider the phrasing: "How can we make the sum 30 by using these three numbers?". It implies selecting three numbers.

Could there be a non-standard interpretation of addition?  Unlikely for this type of riddle.

The only remaining possibility is that the question allows using the *digit* 9, but the phrasing "these numbers" points to the entire values.

The classic solution to the "sum to 30 using three odd numbers" riddle involves using the number 9 and potentially thinking about it differently.

If the numbers were digits, like if the question was about forming numbers using the digits, it would be different. But it asks about using "these numbers".

Perhaps the trick is simpler than I'm making it. Let's assume standard base-10 arithmetic. The problem then boils down to finding three numbers from {7, 9, 11, 13} that sum to 30, with the possibility of repetition.

If repetition is allowed, we explored the cases with two identical numbers. What about three different numbers?  We checked that too.

Could there be a typo in the numbers?  Assuming the image and question are correct, we must work with these values.

The most common trick with summing to 30 with 9 involves representing 9 in a way that changes its value while still being "9". This is usually in the context of digit puzzles, not selecting from a given set of numbers.

Let's assume the question is fair and there's a straightforward answer I'm missing. The phrasing is key: "using these three numbers".

If we can reuse numbers:
- 9 + 11 + 10 (10 not available)

The phrasing could imply we are picking three numbers, not necessarily distinct, from the set.

Consider the possibility of a mistake in my reasoning. Let's review the basic sums.

The problem is reminiscent of finding combinations that sum to a target, a common algorithmic task.

Perhaps the "trick" lies in a non-mathematical interpretation of "using these three numbers."  However, the request is to make a "sum".

The only remaining logical possibility is that there is a way to represent one of the numbers differently, specifically the number 9, to make the sum work. This usually involves a visual or conceptual trick.

In the context of such riddles, if the answer involves the number 9, it often involves flipping it or representing it in a non-decimal form (though this is less common for simple riddles).

Given the constraints of the question and the provided numbers, the most likely solution involves a non-standard interpretation related to the number 9. However, without additional context or rules, a purely mathematical solution using addition of three numbers from the set {7, 9, 11, 13} to get 30 is not possible under standard arithmetic.

The standard trick with the number 9 to make a sum work often involves treating it as a 6 (by flipping). But the question provides the numbers, not digits to manipulate.

Let's assume there's a way to use the number 9 such that it contributes a different value to the sum.

If the question intends a mathematical trick within standard arithmetic, the only flexibility is the possibility of repeating numbers. We explored that.

Could the question be flawed or missing information?  Possibly, but the request is to find a way.

The most common form of this riddle involves odd numbers summing to 30, and the trick is to use 9 twice and then somehow get 12. This doesn't fit the numbers provided.

Let's consider the phrasing again: "これらの数字を3つ使って". This strongly suggests selecting three numbers from the provided set.

Final thought: Could the trick involve using the numbers as part of a date or time?  Unlikely given the direct request for a sum.

The most probable answer, considering similar riddles, involves a non-standard representation of the number 9. However, the image and question present the numbers clearly. If the answer is purely mathematical with these constraints, it's impossible.

Given the context of common number riddles, the "trick" usually involves the number 9. Since we are asked to use the *numbers*, and we need a sum of 30, and simple addition doesn't work, there must be a way to interpret or manipulate the number 9.

One last possibility: Could the question involve a different base?  If the numbers were in a different base, their decimal values would change. However, the numbers are presented in standard decimal form.

The phrasing "using these three numbers" implies we select three. If repetition is allowed, we exhausted those possibilities.

Conclusion: The riddle, as presented, has no solution under standard arithmetic if we are limited to adding three numbers from the set {7, 9, 11, 13}. The likely intent of such a riddle, especially with the number 9 present, suggests a non-standard interpretation or manipulation of the numbers, typically involving how the number 9 is perceived or represented. Without such a trick, the problem is unsolvable.

Considering the common solution to similar riddles, the manipulation usually involves the digit 9, not the number 9 in a set. Since the question specifies "these numbers", digit manipulation is likely not the intent.

If we are strictly limited to the given numbers and standard addition, and we must use exactly three of them (with repetition allowed), there is no solution. The presence of the number 9 strongly hints at a non-standard interpretation being the intended solution path. However, based on the direct wording and image, such a trick is not explicitly supported by the information provided.

The most likely scenario is a common riddle with a standard trick that's not immediately obvious from the strict interpretation of the question and image. The standard trick involves the visual representation of 9. However, the question asks to use the *numbers*.

Final logical deduction: The riddle intends a non-standard way of using the number 9 to achieve the sum. Without that trick, it's impossible.

## Answer
これは一般的ななぞなぞで、ちょっとした工夫が必要です。数字を逆さまにして考えるとうまくいきます。

数字の「9」を逆さまにすると「6」に見えます。

そこで、以下のように計算します。

**11 + 13 + 6 = 30**

このように考えると、3つの数字を使って合計を30にすることができます。

kun432

（字数オーバーしたので続き）

複数の知識を組み合わせた例。サンプルは以下のようなプロンプトになっている。

response = client.models.generate_content(
    model="gemini-2.0-flash-thinking-exp",
    contents="Give me a practice question I can use for the AP Physics C exam?"
)

print("## Thoughts")
print(response.candidates[0].content.parts[0].text)
print("## Answer")
print(response.candidates[0].content.parts[1].text)

この”AP Physics C exam”というのは以下のようなものらしい。ChatGPT調べ。

AP Physics C Examは、アメリカのCollege Boardが提供するAdvanced Placement（AP）プログラムの一部で、高校生が大学レベルの物理学を学び、試験で一定のスコアを得ることで大学の単位を取得できるプログラムです。この試験は、高度な物理学の知識と計算能力を必要とし、以下の2つの分野に分かれています。

AP Physics C: Mechanics

力学（Mechanics）を中心にした試験で、以下のトピックをカバーします：

運動学（Kinematics）

ニュートンの運動法則（Newton's Laws of Motion）

エネルギーと仕事（Work, Energy, and Power）

運動量（Momentum）

回転運動（Rotational Motion）

重力（Gravitation）

AP Physics C: Electricity and Magnetism

電磁気学（Electricity and Magnetism）を中心にした試験で、以下のトピックをカバーします：

静電気学（Electrostatics）

電場と電位（Electric Fields and Potential）

電流と回路（Current and Circuits）

磁場（Magnetic Fields）

電磁誘導（Electromagnetic Induction）

特徴

数学の知識が必要：

試験には、微分積分（Calculus）の概念が含まれます。したがって、Calculusを並行して学習している、または既に修了している学生が受験することを推奨されています。

試験形式：

試験は2セクションに分かれています：

Multiple-Choice（選択式問題）

Free-Response（記述式問題）

各セクションは試験時間が45分間で、合計90分です。

スコア範囲：

1から5のスコアで採点されます。多くの大学では4以上のスコアを取得することで単位認定の対象になります。

難易度：

AP Physics Cは、AP Physics 1やAP Physics 2と比較して難易度が高く、特にSTEM（科学、技術、工学、数学）分野の進学を目指す学生が受験することが多いです。

誰が受験するのか？

物理学、工学、または数学を専攻したいと考えている高校生に特に適しています。

STEM分野の進学を目指している場合、この試験の良いスコアは大学入学時の評価や単位取得に有利です。

準備方法

高校でAP Physics Cの授業を受講する。

独学でAP Physics Cの教材（参考書やオンラインリソース）を利用する。

過去問や模試を解いて試験形式に慣れる。

試験は毎年5月に実施されます。物理学の基礎と数学の応用をしっかり理解していることが重要です。

日本だと同じようなものがないけども、大学入試センター試験の物理に置き換えてやってみる。

response = client.models.generate_content(
    model="gemini-2.0-flash-thinking-exp",
    contents="日本の大学入試センター試験 物理の練習問題を1つ作ってみて。"
)

print("## Thoughts")
print(response.candidates[0].content.parts[0].text)
print("## Answer")
print(response.candidates[0].content.parts[1].text)

結果

出力

## Thoughts

Here's a breakdown of the thought process behind creating that practice physics problem:

1. **Understanding the Request:** The user wants a practice problem in the style of the Japanese University Entrance Examination's physics section (formerly the National Center Test, now the Common Test). This implies:
    * **Multiple-choice format:**  Standard for this exam.
    * **Focus on core concepts:**  Often covers fundamental principles rather than extremely advanced topics.
    * **Reasonable complexity:**  Solvable within a few minutes.
    * **Clear and concise language:**  No overly ambiguous phrasing.
    * **Appropriate units:**  Using standard SI units.
    * **Relatable context:**  A scenario that students can visualize.

2. **Choosing a Topic:** Mechanics is a common and fundamental topic in introductory physics. Within mechanics, projectile motion is a good choice as it involves several key concepts:
    * Initial velocity and its components.
    * Acceleration due to gravity.
    * Independence of horizontal and vertical motion.
    * Kinematic equations.

3. **Designing the Scenario:** A simple scenario is best. Throwing a ball is easily understood. Adding a wall introduces an interesting constraint and makes the problem slightly more involved than a standard free-fall problem. Key parameters to define:
    * Initial velocity magnitude and angle.
    * Distance to the wall.
    * Height of the wall.

4. **Formulating the Questions:** To make it a good practice problem, I decided to break it down into steps, covering different aspects of the motion:
    * **Question 1 (ア):** Horizontal velocity component. This tests understanding of vector resolution.
    * **Question 2 (イ):** Vertical velocity component. Similar to (ア).
    * **Question 3 (ウ):** Time to reach the wall. This connects horizontal distance and horizontal velocity.
    * **Question 4 (エ):** Vertical displacement at the wall. This combines initial vertical velocity, time, and acceleration due to gravity.
    * **Question 5 (オ):** Whether the ball hits the wall. This requires comparing the vertical displacement to the wall's height.

5. **Choosing Realistic Values:** The numerical values should be easy to work with (to some extent) but also physically plausible.
    * Initial velocity: 10 m/s is a reasonable throwing speed.
    * Angle: 30 degrees is a common angle in such problems.
    * Distance to wall: 4 m is a manageable distance.
    * Wall height: 2 m makes the "hit or miss" question interesting.
    * Gravity: Using g = 9.8 m/s² for accuracy.

6. **Crafting the Multiple-Choice Options:**  For each question, I needed to provide several plausible answer choices, including the correct answer and common mistakes students might make. This involves:
    * **Correct Calculation:**  Double-checking the math for the correct answer.
    * **Common Errors:**  Thinking about typical mistakes, like using the wrong trigonometric function, forgetting the negative sign for acceleration, or making algebraic errors. Some distractors might involve using the total velocity instead of components.

7. **Writing the Explanation:**  A detailed explanation is crucial for learning. The explanation should:
    * **Clearly state the relevant physics principles.**
    * **Show the step-by-step calculations.**
    * **Explain the reasoning behind each step.**
    * **Highlight any important concepts or formulas used.**

8. **Review and Refinement:**  After drafting the problem and solutions, I reviewed everything for:
    * **Clarity:** Is the wording clear and unambiguous?
    * **Accuracy:** Are the calculations correct?
    * **Completeness:** Does the explanation cover all necessary points?
    * **Appropriateness:** Is the difficulty level suitable for the intended audience?
    * **Formatting:** Is it easy to read and understand?

**(Self-Correction Example During the Process):** Initially, I considered asking for the maximum height reached, but decided against it to keep the problem focused on the wall interaction. I also initially considered a more complex scenario with air resistance, but simplified it to match the typical scope of introductory physics problems on the Center Test. I also made sure the multiple-choice answers weren't too obviously wrong or too close together, aiming for realistic distractors.

By following these steps, I aimed to create a practice problem that is not only solvable but also helps students reinforce their understanding of fundamental physics concepts.


## Answer
はい、物理の練習問題を作成します。以下は、日本の大学入学共通テストの物理基礎の内容を参考に作成した問題です。

---

**物理 練習問題**

図のように、水平な地面から高さ \(h \, [\text{m}]\) の位置から、質量 \(m \, [\text{kg}]\) の小球を水平方向に速さ \(v_0 \, [\text{m/s}]\) で投げ出した。重力加速度の大きさを \(g \, [\text{m/s}^2]\) とし、空気抵抗は無視できるものとする。

[図を挿入：高さ h の台から水平にボールが投げ出される図]

**問1** 小球が地面に落下するまでの水平方向の移動距離 \(L \, [\text{m}]\) を表す式として正しいものを、次のア～オの中から一つ選びなさい。

ア． \(v_0 \sqrt{\frac{h}{g}}\)
イ． \(v_0 \sqrt{\frac{2h}{g}}\)
ウ． \(\frac{v_0^2}{2g}\)
エ． \(\frac{2v_0^2}{g}\)
オ． \(v_0 \sqrt{\frac{g}{2h}}\)

**問2** 小球が地面に落下する直前の、鉛直方向の速さ \(v_y \, [\text{m/s}]\) を表す式として正しいものを、次のア～オの中から一つ選びなさい。

ア． \(\sqrt{gh}\)
イ． \(\sqrt{2gh}\)
ウ． \(gh\)
エ． \(2gh\)
オ． \(\frac{gh}{v_0}\)

**問3** 小球が地面に落下する直前の、速度の向きと地面とのなす角を \(θ\) とする。\( \tan θ \) を表す式として正しいものを、次のア～オの中から一つ選びなさい。

ア． \(\frac{v_0}{\sqrt{2gh}}\)
イ． \(\frac{\sqrt{2gh}}{v_0}\)
ウ． \(\frac{v_0}{gh}\)
エ． \(\frac{gh}{v_0}\)
オ． \(\frac{v_0}{g}\)

---

**解答・解説**

**問1 解答：イ**

**解説:**

小球を投げ出した位置を原点とし、水平方向を \(x\) 軸、鉛直下向きを \(y\) 軸とする。

*   **水平方向の運動:** 等速直線運動
    *   速度: \(v_x = v_0\)
    *   変位: \(x = v_0 t\)
*   **鉛直方向の運動:** 自由落下運動
    *   初速度: \(0\)
    *   加速度: \(g\)
    *   変位: \(y = \frac{1}{2} g t^2\)

小球が地面に落下するまで、鉛直方向に \(h\) だけ落下するので、
\(h = \frac{1}{2} g t^2\)
これより、落下するまでの時間 \(t\) は、
\(t = \sqrt{\frac{2h}{g}}\)

この時間 \(t\) を用いて、水平方向の移動距離 \(L\) は、
\(L = v_0 t = v_0 \sqrt{\frac{2h}{g}}\)

**問2 解答：イ**

**解説:**

鉛直方向の運動は自由落下であり、等加速度運動の公式を用いる。
落下直前の鉛直方向の速さ \(v_y\) は、
\(v_y^2 - 0^2 = 2 g h\)
\(v_y^2 = 2gh\)
\(v_y = \sqrt{2gh}\)

または、問1で求めた落下時間 \(t\) を用いて、
\(v_y = gt = g \sqrt{\frac{2h}{g}} = \sqrt{g^2 \cdot \frac{2h}{g}} = \sqrt{2gh}\)

**問3 解答：イ**

**解説:**

落下直前の小球の速度の水平成分は \(v_0\)、鉛直成分は \(v_y = \sqrt{2gh}\) である。
速度ベクトルと地面とのなす角 \(θ\) について、
\(\tan θ = \frac{\text{鉛直方向の速さ}}{\text{水平方向の速さ}} = \frac{v_y}{v_0} = \frac{\sqrt{2gh}}{v_0}\)

---

**ポイント:**

*   水平方向と鉛直方向の運動を分けて考えることが重要です。
*   水平方向は等速直線運動、鉛直方向は等加速度運動（自由落下）となります。
*   落下するまでの時間を共通の変数として、それぞれの運動を結びつけます。
*   三角関数の定義を理解していると、角度に関する問題も解きやすくなります。

この問題を通して、水平投射の基本的な考え方と計算方法を理解しておきましょう。

思考過程部分を翻訳するとこんな感じ。DeepL訳。

日本語訳

その物理の練習問題を作るための思考プロセスの内訳は以下の通りです：

1. **1.要求の理解：** ユーザーは、日本の大学入試の物理セクション（以前はセンター試験、現在は共通テスト）のスタイルの練習問題を望んでいます。これは
    * 多肢選択形式:** この試験の標準。
    * 核となる概念に重点を置く：**極端に高度なトピックではなく、基本的な原理を扱うことが多い。
    * 適度な複雑さ：**数分以内に解くことができます。
    * 明確で簡潔な表現：**過度にあいまいな表現はありません。
    * 標準的なSI単位を使用。
    * 生徒がイメージできるシナリオ。

2. **力学は物理学入門の一般的で基本的なトピックです。力学の中でも、いくつかの重要な概念を含む投射運動は良い選択です：
    * 初速度とその構成要素。
    * 重力による加速度。
    * 水平運動と垂直運動の独立性。
    * 運動方程式。

3. **簡単なシナリオが一番です。ボールを投げることは簡単に理解できます。壁を追加すると、興味深い制約が導入され、標準的な自由落下問題よりも少し複雑になります。定義すべき主なパラメータ
    * 初速の大きさと角度。
    * 壁までの距離
    * 壁の高さ。

4. **問題の構成： **良い練習問題にするために、ステップに分けることにしました：
    * 水平方向の速度成分。ベクトル分解能の理解をテストします。
    * 第2問（イ）：垂直方向の速度成分。(ア)と同様。
    * 第3問（ウ）：壁に到達するまでの時間。水平距離と水平速度の関係。
    * 質問4（エ）：壁での垂直変位。初速度、時間、重力加速度。
    * 質問5(オ):ボールが壁に当たるかどうか。これは垂直変位と壁の高さを比較する必要があります。

5. **現実的な値の選択：*** 数値は（ある程度）扱いやすく、かつ物理的に妥当なものであるべきです。
    * 初速：10m/sは妥当な投球速度です。
    * 角度：30度は、このような問題では一般的な角度です。
    * 壁までの距離： 壁までの距離：4mは扱いやすい距離です。
    * 壁の高さ：2mは「当たるか外れるか」の問題を面白くします。
    * 重力： 正確さのためにg = 9.8 m/s²を使用。

6. **各問題について、私は正解と生徒が犯しそうな一般的な間違いを含む、いくつかのもっともらしい答えの選択肢を提供する必要がありました。これには以下が含まれます：
    * 正しい計算：正しい答えになるように計算をダブルチェックします。
    * 間違った三角関数を使ったり、加速度の負符号を忘れたり、代数的な間違いをしたりするような典型的な間違いについて考えます。構成要素の代わりに全速度を使用するようなミスもあります。

7. **説明の書き方：** 詳細な説明は学習にとって非常に重要です。説明の書き方
    * 関連する物理原理を明確に述べてください。
    * 段階的な計算を示すこと。
    * 各ステップの背後にある理由を説明します。
    **Highlight any important concepts or formula used.**.

8. **問題と解答を起草した後、すべてを見直しました：
    * 明確さ：**表現は明確で曖昧さがないか？
    * 計算が正しいか？
    * 説明は必要な点をすべて網羅していますか？
    * 適切性： **難易度は想定される読者に適していますか？
    * 書式：*** 読みやすく、理解しやすいか？

**(プロセス中の自己修正例):** 当初、到達した最大高さを尋ねることも考えましたが、問題を壁との相互作用に集中させるためにやめました。また、当初は空気抵抗を使ったより複雑なシナリオも考えましたが、センター試験の物理入門問題の典型的な範囲に合わせて単純化しました。また、択一式の答えが明らかに間違っていたり、近すぎたりしないようにし、現実的な散漫さを狙いました。

このようなステップを踏むことで、解けるだけでなく、物理の基本概念の理解を深めることができる練習問題を目指しました。

お、センター試験とはもう言わないのねｗ。それはともかく、与えられた条件に照らし合わせて、かつ、生成した結果も推敲しながら、作成しているのがわかる。

kun432

Video understanding with Gemini 2.0

マルチモーダルは動画にも対応している。このこと自体は前のモデルから変わってはいないと思う。

何が変わったのか？はこの動画で説明されている様子。

とりあえず触ってみる。

Gen AI SDKをインストール。

!pip install -U -q google-genai

APIキーをセット

from google.colab import userdata
import os

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

クライアントを初期化しておく

from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

サンプルの動画をダウンロード

!wget https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4 -O Pottery.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4 -O Trailcam.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4 -O Post_its.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4 -O User_study.mp4 -q

これらをまずアップロードするようなのだが、どこにアップロードされるのだろうか？後で調べてみる。

import time

def upload_video(video_file_name):
  video_file = client.files.upload(path=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

pottery_video = upload_video('Pottery.mp4')
trailcam_video = upload_video('Trailcam.mp4')
post_its_video = upload_video('Post_its.mp4')
user_study_video = upload_video('User_study.mp4')

ファイルに対してURLが発行される。URLにダイレクトにアクセスしても認証がなければアクセスできない様子。

出力

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/AAAAAAAAAA
Waiting for video to be processed.
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/BBBBBBBBBB
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/CCCCCCCCCC
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/DDDDDDDDDD

ではまず以下の動画を使ってみる。

# 「このビデオの各シーンについて、シーンを説明するキャプションを、引用符で囲まれた話し言葉
#   とともに生成してください。
#   各キャプションは、動画内のキャプションのタイムコードを持つオブジェクトに配置してください。」
prompt = "For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video." 
video = trailcam_video

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part.from_uri(
                    file_uri=video.uri,
                    mime_type=video.mime_type),
                ]),
        prompt,
    ]
)

print(response.text)

出力

```json
[
  {
    "timecode": "00:00",
    "caption": "A close-up shot of an animal's fur."
  },
  {
    "timecode": "00:01",
     "caption": "A gray fox walks through a rocky area with scattered leaves and brush on the ground."
  },
  {
    "timecode": "00:02",
      "caption": "Two gray foxes walk around in a rocky area with scattered leaves and brush on the ground."
  },
   {
    "timecode": "00:16",
      "caption": "A gray-scale shot of a mountain lion walking through a forest with scattered leaves on the ground."
  },
    {
    "timecode": "00:35",
    "caption": "Two foxes walk around in the dark, with leaves and sticks scattered on the ground. One lies down as the other investigates."
  },
   {
    "timecode": "00:50",
    "caption": "A brief flash of bright light as audio crackles."
   },
  {
    "timecode": "00:51",
    "caption": "Two foxes run away from a rocky area while a third fox stays behind."
  },
   {
    "timecode":"01:05",
    "caption":"A mountain lion walks away in the dark towards a rocky hill."
  },
   {
   "timecode": "01:18",
   "caption": "Two mountain lions walk around a rocky hill in the dark."
   },
   {
     "timecode": "01:29",
     "caption": "A bobcat walks slowly in the dark, with leaves on the ground, under a tree."
    },
  {
    "timecode": "01:51",
    "caption": "A brown bear walks through a wooded area with leaves and rocks on the ground."
  },
  {
   "timecode": "01:57",
      "caption":"A mountain lion walks around a wooded area in gray scale, with trees in the background."
  },
  {
    "timecode": "02:04",
      "caption": "A close-up shot of a bear's fur."
  },
    {
   "timecode":"02:07",
      "caption":"A brown bear walks away in a wooded area, with leaves on the ground and trees in the background."
   },
    {
      "timecode":"02:23",
       "caption":"A fox looks out into the lights of a city in the dark."
    },
  {
      "timecode":"02:35",
     "caption":"A bear stands on a snowy path at night looking down towards a city."
   },
   {
     "timecode":"02:51",
     "caption": "A mountain lion walks under a tree at night, looking for food."
  },
  {
    "timecode":"03:05",
    "caption": "A brown bear stands among trees in a wooded area."
  },
  {
      "timecode": "04:22",
       "caption": "A bobcat is sitting in a wooded area at night."
    },
    {
      "timecode": "04:57",
      "caption": "A mountain lion searches for food under a tree at night."
    }
]
```

英語のプロンプトだとMarkdownのJSONで出力されるのだけど、日本語だと明示的に"JSON"を指定しないと同じようにはならなかった

JSONを指定しない場合

prompt = """\
このビデオの各シーンについて、シーンを説明するキャプションを、引用符で囲まれた話し言葉とともに生成してください。
各キャプションは、動画内のキャプションのタイムコードを持つオブジェクトに配置してください。
"""

出力

承知いたしました。動画の各シーンを説明するキャプションを生成します。

[00:00:00] 「何が起きてるの？」
[00:00:03] 「ほら、何がいたか見て！」
[00:00:16] 「私は今、私の一部を食べています。」
[00:00:35] 「私はもっと大きくて、もっと強くて、もっと速い！お尻を蹴ってやる！」
[00:00:50] 「何だよ、そんなに興奮するな！」
[00:01:05] 「そこまで行くのをやめて、私はこの岩を横に通り過ぎるでしょう。」
[00:01:17] 「私はより早く動きます！」
[00:01:28] 「あなたはとても美しいですね！」
[00:01:50] 「私は今、とても大きいです。」
[00:01:56] 「あれ、私の尻尾をどこにやったかな？」
[00:02:04] 「私は本当にカメラが好きではありません。」
[00:02:22] 「私はここで食べます。」
[00:02:34] 「誰か来ている？」
[00:02:52] 「私のお尻は、まだここにある？」
[00:03:04] 「私は、より強いですね？」
[00:03:21] 「何かな？」
[00:04:22] 「やあ、あなたたちみんな。」
[00:04:33] 「お尻を洗っている。」
[00:04:56] 「何かな、私はそれを食べます。」
[00:05:05] 「私はあなたの友達です！」

JSONを指定した場合

prompt = """\
このビデオの各シーンについて、シーンを説明するキャプションを、引用符で囲まれた話し言葉とともに生成してください。
各キャプションは、動画内のキャプションのタイムコードを持つJSONオブジェクトに配置してください。
"""

出力

はい、できます。これは、ビデオ内の各シーンのキャプションを説明するJSONオブジェクトです。

```json
[
    {
        "timecode": "0:00",
        "caption": "ぼやけて見える何か。おそらく動物の一部の毛皮でしょうか。"
    },
    {
        "timecode":"0:01",
        "caption":"カメラが下に向くと、2匹のグレーのキツネの1匹が岩のある風景の中を歩いています。"
     },
    {
         "timecode":"0:16",
        "caption":"夜の白黒の映像には、森の中を歩き回っているピューマの姿が映っています。"
     },
  {
        "timecode":"0:35",
        "caption":"夜の別の白黒の映像には、2匹のキツネが森の床を掘っている様子が映っています。"
     },
   {
         "timecode":"0:50",
         "caption":"カメラが何かに蹴られたときの白いフラッシュ。"
   },
   {
         "timecode":"0:51",
          "caption":"別の夜の白黒の映像には、2匹のキツネが岩場を走り回る姿が映っています。"
    },
   {
          "timecode":"1:05",
        "caption":"1匹のピューマが岩場の斜面を見つめている夜の白黒の映像です。"
   },
  {
         "timecode":"1:18",
        "caption":"夜の白黒映像に、2匹のピューマの子が岩場を走り回っている姿が映っています。"
    },
  {
        "timecode":"1:28",
        "caption":"夜の白黒映像に、1匹のボブキャットがカメラを見て立っている様子が映っています。"
     },
    {
         "timecode":"1:51",
          "caption":"1匹の黒いクマが、森の中を歩き回っています。"
    },
   {
         "timecode":"1:57",
        "caption":"夜の白黒映像に、1匹のピューマが森の中を歩いている様子が映っています。"
    },
   {
        "timecode":"2:05",
        "caption":"カメラが動物の毛皮で覆われているため、もう一度ぼやけて見える何か。"
     },
  {
         "timecode":"2:07",
       "caption":"黒いクマが森の中を歩き回っているときの別の映像です。"
    },
   {
          "timecode":"2:23",
        "caption":"夜の白黒映像には、街の灯りの背景に丘の上に立っているキツネが映っています。"
    },
  {
         "timecode":"2:34",
       "caption":"夜の白黒映像に、街の灯りの背景に丘の上を歩いているクマの姿が映っています。"
    },
  {
       "timecode":"2:51",
       "caption":"夜の白黒映像に、森の床で何かを漁っているピューマの姿が映っています。"
    },
  {
        "timecode":"3:05",
        "caption":"1匹の黒いクマが森の開けた場所に立っています。"
     },
    {
        "timecode":"3:21",
         "caption":"森の床で食べ物を探している2匹の茶色のクマです。"
      },
    {
      "timecode":"4:22",
      "caption":"夜の白黒の映像には、森の木の幹を横切って座り、歩き回るボブキャットの姿が映っています。"
  },
  {
        "timecode":"4:57",
        "caption":"夜の別の白黒の映像に、森の床で何かを漁っているピューマの姿が映っています。"
   }
]
```
これでいいでしょうか。

ノートブックにも以下とある。

使用されているプロンプトは非常に一般的なものですが、ニーズに合わせてカスタマイズすれば、さらに良い結果を得ることができます（特にキツネを求めるなど）。

プロンプトはきちんと定義してあげたほうが良さそう、日本語の場合は特に。

動画内のテキストを抽出する。付箋が表示されている動画を使う。

prompt = "付箋を文字起こしして、整理して表にまとめてください。他にアイデアがあれば教えて。"
video = post_its_video

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part.from_uri(
                    file_uri=video.uri,
                    mime_type=video.mime_type),
                ]),
        prompt,
    ]
)

print(response.text)

出力

はい、承知いたしました。画像にある付箋に書かれた単語を文字起こしし、整理して表にまとめます。また、関連するアイデアも合わせてご紹介します。

**付箋に書かれた単語一覧**

|  グループ1  |    グループ2   |    グループ3  |     グループ4    |
| :--------: | :--------: | :--------: | :--------: |
| Lunar Eclipse     |   Chaos Field     |   Prometheus Rising   | Galactic Core  |
| Canis Major      |   Supernova Echo    |  Lynx    |   Titan |
| Leo Minor |    Convergence     |   Draco  |   Odin  |
| Andromeda's Reach   |   Astral Forge    |   Chimera Dream  |  Aether  |
| Stellar Nexus    |    Comet's Tail     |  Persius Shield | Phoenix |
| Orion's Belt      |  Delphinus   |   Euler's Path    |  Cerberus   |
| Lyra      |   Serpens    |    Zephyr   | Vector| 
|  Bayes Theorem | Centaurus | Echo |  Medusa |
|  Riemann's Hypothesis | Equilibrium | Athena |  Hera |
|  Sagitta | Symmetry |  Chaos Theory |   Orion's Sword |
| Celestial Drift   | Golden Ratio    |   Athena's Eye    |  Pandora's Box |
|    Fractal     | Infinity Loop |  |   Taylor Series |
|  |   Stokes Theorem   |   |    |

**整理のポイント**

* **グループ分け:**
    *   天文学関連 (星座、天体現象)
    *   神話関連 (ギリシャ神話、ローマ神話)
    *   数学・科学関連 (定理、概念)
    *   抽象的な概念(均衡、対称性、永遠)
*   **配置:**  付箋の配置から、関連性が高いと思われるものをグループ化しました。

**関連アイデア**

1.  **プロジェクトテーマの拡張:**
    *   これらの単語を起点として、さらに連想ゲームのようにアイデアを広げていく。
    *   グループごとに、共通点や関連性を見つけるワークショップを行う。
    *   各単語の背後にある意味や物語を調査し、プロジェクトに深みを持たせる。

2. **組み合わせによる新たな概念創出:**
   * グループを越えて単語を組み合わせる。例：「月の食」＋「カオス理論」＝予想不可能な天体現象やイベントの比喩。
    *  各グループからキーワードをピックアップし、組み合わせた時にどのような新しいアイデアが生まれるかを検討する。
    *  異なる分野の概念を組み合わせることで、ユニークな視点やアプローチが生まれる可能性がある。

3. **プロジェクトに合った絞り込み:**
   * 今回のようにアイデア出しが終わった段階では、チームのプロジェクトのテーマや方向性に最も合致する単語を絞り込む。
   * 特定のテーマに関連性の高いグループに焦点を当て、そこからさらにアイデアを深化させていく。
    *  プロジェクトの目標や制約条件に基づいて、最も適切な単語を選ぶ。

4.  **視覚的な表現の検討:**
    *   各単語からインスパイアされたイラストや抽象的なデザインを作成する。
    *   付箋の色や配置を工夫して、単語間の関係性を視覚的に表現する。
    *   最終的なプロジェクトでは、これらの視覚要素を効果的に活用する。

5.  **ストーリーテリングの要素導入:**
    *   各単語が持つ物語性を繋ぎ合わせ、一つのストーリーを構築する。
    *   神話や科学的な背景を活かし、プロジェクトに深みと興味を持たせる。
    *   物語を通じて、プロジェクトのテーマやメッセージを効果的に伝える。

**その他のアイデア**

*   **色分け:** 付箋をグループごとに色分けすることで、視覚的に整理しやすくなります。
*   **マインドマップ化:** 各単語を中心としたマインドマップを作成することで、関連性を可視化できます。
*   **投票:**  各チームメンバーが最も興味のある単語に投票し、優先順位を決める。

これらのアイデアが、プロジェクトを進める上で少しでもお役に立てれば幸いです。何かご不明な点があれば、お気軽にお尋ねください。

テキストと物体の両方を組み合わせて推論する。陶器と手書きメモ（値段等が書いてある）が並んでいる動画を使う。

prompt = "物とメモを表にまとめて。"
video = pottery_video

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part.from_uri(
                    file_uri=video.uri,
                    mime_type=video.mime_type),
                ]),
        prompt,
    ],
    config = types.GenerateContentConfig(
        system_instruction="ドル文字をエスケープすることを忘れないで。",
    )
)

print(response.text)

出力

はい、もちろんです。テーブルはこちらです。

| 品名         | 高さ     | 直径   | 価格   | その他      |
| ------------- | -------- | ----- | -------- | ------------ |
| タンブラー      | 4インチ     | 3インチ   | \$20    | #5 Artichoke ダブルディップ |
| 小鉢        | 3.5インチ | 6.5インチ | \$35    |    |
| 中鉢 | 4インチ | 7インチ | \$40  |         |
| ジェミニ ダブルディップ   |         |         |   | #6 Slow cool  |

スクリーンキャプチャ動画を解析する。例えばアプリケーションをユーザがどのように使用しているか？といった解析に使える。

prompt = """
このビデオを要約する段落を作成してください。対応するタイムコードとともに、3～5文にまとめてください。
"""

video = user_study_video

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part.from_uri(
                    file_uri=video.uri,
                    mime_type=video.mime_type),
                ]),
        prompt,
    ],
)

print(response.text)

出力

もちろんです。以下にビデオの要約を示します。

この動画では、「マイガーデンアプリ」アプリの使い方を紹介しています。
まず、アプリのホームページに植物が表示されます（0:00）。
ユーザーは各植物の「いいね」ボタンを押して、その植物を「いいね」リストに保存できます（0:09）。
「カートに入れる」ボタンを押すと、植物がショッピングカートに追加されます（0:13）。
最後に、ユーザーはショッピングカートに移動し、購入したい商品をリストで確認できます（0:30）。

kun432

アップロードしたファイルだけど、どこにアップロードされているのかちょっとわからない。。。Google AI専用のGCSが作成されてる？のかなぁ・・・・

一応以下のようにしてアクセスはできる。

from google import genai
import json

client = genai.Client(api_key=GOOGLE_API_KEY)

for file in client.files.list():
    print(file.model_dump_json(indent=2))

出力

{
  "name": "files/AAAAAAAAAA",
  "display_name": null,
  "mime_type": "video/mp4",
  "size_bytes": 10300135,
  "create_time": "2024-12-22T13:56:40.864283Z",
  "expiration_time": "2024-12-24T13:56:40.810387Z",
  "update_time": "2024-12-22T13:56:44.253843Z",
  "sha256_hash": "8f41101bdc80b80589133169e10c5ebc9e08c67b3ccd5fd84e8b0d62af0fed99",
  "uri": "https://generativelanguage.googleapis.com/v1beta/files/AAAAAAAAAA",
  "state": "ACTIVE",
  "video_metadata": {
    "videoDuration": "52s"
  },
  "error": null
}
{
  "name": "files/BBBBBBBBBB",
  "display_name": null,
  "mime_type": "video/mp4",
  "size_bytes": 54348437,
  "create_time": "2024-12-22T13:56:29.290478Z",
  "expiration_time": "2024-12-24T13:56:29.262392Z",
  "update_time": "2024-12-22T13:56:33.832613Z",
  "sha256_hash": "ce97b9b1eaba2f8013ae4d916c52c67b665d09fefa35b3cce9c997a166c72563",
  "uri": "https://generativelanguage.googleapis.com/v1beta/files/BBBBBBBBBB",
  "state": "ACTIVE",
  "video_metadata": {
    "videoDuration": "32s"
  },
  "error": null
}
{
  "name": "files/CCCCCCCCCC",
  "display_name": null,
  "mime_type": "video/mp4",
  "size_bytes": 104449611,
  "create_time": "2024-12-22T13:56:06.943739Z",
  "expiration_time": "2024-12-24T13:56:06.915367Z",
  "update_time": "2024-12-22T13:56:18.877190Z",
  "sha256_hash": "103436242b723dec21c5e886f52908e44db9d38d2b6937a5a588d9e80ac16d1c",
  "uri": "https://generativelanguage.googleapis.com/v1beta/files/CCCCCCCCCC",
  "state": "ACTIVE",
  "video_metadata": {
    "videoDuration": "310s"
  },
  "error": null
}
{
  "name": "files/DDDDDDDDDD",
  "display_name": null,
  "mime_type": "video/mp4",
  "size_bytes": 70305802,
  "create_time": "2024-12-22T13:55:53.652370Z",
  "expiration_time": "2024-12-24T13:55:53.622943Z",
  "update_time": "2024-12-22T13:55:58.981168Z",
  "sha256_hash": "4c18f71626225d371d528f15baefbfa6297cbb19e027c1104548d86259307077",
  "uri": "https://generativelanguage.googleapis.com/v1beta/files/DDDDDDDDDD",
  "state": "ACTIVE",
  "video_metadata": {
    "videoDuration": "12s"
  },
  "error": null
}

以下のようにすれば消せる。

client.files.delete(name="files/AAAAAAAAAA")

まあサービスで使うならばVertex AIでストレージなんかもきちんと管理した方が良いのはそれはそう。

kun432

 まとめとりあえずモリモリなんだけど、最近のLLMで必要と思わえれるものがすべてカバーされているような印象。
これで性能も良いとなると、一般的なユースケースならもうGeminiだけでいいんじゃない？というふうに考える場合はそこそこにありそう。

kun432

このスクラップは2025/01/09にクローズされました