Vertex AIのImagenでImage to Captionを試す

Tanner / テナー

Model Gardenで検索すると、対象のModelがあった

Tanner / テナー

Use caseを見てみると、画像も動画もできるっぽい

Use cases

Creators can generate captions for uploaded images and videos (for example, a short description of a - video sequence)
Generate captions to describe products
Integrate Imagen captioning with an app using the API to create new experiences

Tanner / テナー

Pythonで試す

import base64
import json
import requests
from google.oauth2 import service_account
from google.auth.transport.requests import Request
# サービスアカウントのキーファイルへのパスを指定
credentials = service_account.Credentials.from_service_account_file(
    './your-key.json',
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)


credentials.refresh(Request())
# アクセストークンを取得
access_token = credentials.token

headers = {
    'Authorization': f"Bearer {access_token}",
    'Content-Type': 'application/json; charset=utf-8'
}

# 以下の変数に必要な情報を設定します
PROJECT_ID = "YOUR-PJ-ID"
B64_IMAGE = base64.b64encode(open("./your/img.png", "rb").read()).decode()
RESPONSE_COUNT = 3  # 受け入れられる整数値は1から3
LANGUAGE_CODE = "en"  # サポートされている言語コードのいずれか

# POSTリクエスト用のURL
url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/us-central1/publishers/google/models/imagetext:predict"

# リクエスト用のJSON本体
request_data = {
    "instances": [
        {
            "image": {
                "bytesBase64Encoded": B64_IMAGE
            }
        }
    ],
    "parameters": {
        "sampleCount": RESPONSE_COUNT,
        "language": LANGUAGE_CODE
    }
}

# POSTリクエストを実行
response = requests.post(url, headers=headers, json=request_data)

# レスポンスをJSONとして解析
response_data = response.json()

# 応答を出力
print(json.dumps(response_data, indent=2))

Tanner / テナー

以下の画像をAPIに送った場合

Tanner / テナー

以下のような返答が返ってくる。すげぇ、あってそう。

{
  "predictions": [
    "an orange and white cat wearing a blue hat and collar",
    "a cat wearing a blue hat and collar sits on a table",
    "a cat wearing a blue hat and collar is sitting on a table"
  ],
  "deployedModelId": "XXXXXXXXXXXXX"
}

Tanner / テナー

対応している言語は以下の言語みたい。どうしても日本語で使う必要があれば、翻訳のAPIを挟めば良さそう

English (en)
French (fr)
German (de)
Italian (it)
Spanish (es)

Tanner / テナー

画像1枚（1024 x 1024 1.8MB)を使って処理にかかった時間はおおよそ15秒ぐらい

Tanner / テナー

コストは、公式ドキュメントによると画像あたりで$0.0015とのこと。0.2円ぐらい

Tanner / テナー

BLIP2のモデルを使えば、画像へのCaptioningもVQA（画像に対するQA）もできるみたい。

Tanner / テナー

記事にした