📱

GPT-4oのマルチモーダル入力をiOSから試す

2024/05/23に公開

2024年5月21日に開催されたイベントで発表した内容です。GPT-4oのマルチモーダル機能を、iOSからAPIをたたいて試してみた、という話。

リアルタイムにGPT-4oに動画を理解してもらうアプリを実装し、デモも行いました：

https://youtu.be/bF5CW3b47Ss

またデモ・各種サンプルコードは以下で公開しています。

スライド・発表動画

スライドはこちら：

発表動画 ^[1] がこちら：

以下、発表資料を記事として再構成したものになります。登壇後に調査した内容も追記しています。

GPT-4oと「マルチモーダル」

GPT-4oのモダリティ

2024.5.13 GPT-4o発表

Hello GPT-4o | OpenAI の1行目：

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

→ GPT-4oの肝はマルチモーダル

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.

入力：テキスト・音声・画像・動画のあらゆる組み合わせ
出力：テキスト・音声・画像のあらゆる組み合わせ

End-to-end

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

こうだったのが：

音声 → [Whisper] → テキスト → [GPT] → テキスト → [TTS] → 音声

こうなる：

音声 → [GPT] → 音声

「マルチモーダルコミュニケーション」の主戦場といえば・・・

スマホ

いつでもどこでも使える
みんな持ってる
マイク・カメラ完備
（テキスト入力めんどくさい）

→ 本発表のテーマ： モバイルアプリ ⇔ GPT のマルチモーダルコミュニケーションに備えよう

もうAPIは対応しているのか？

Chat Completion API対応状況

モデルにGPT-4oを指定可能
- add gpt-4o model (#1417) · openai/openai-python
各種モダリティのAPI対応状況

モダリティ	入力	出力
テキスト	◯	◯
画像	◯	✕
動画	△（後述）	-
音声	✕	✕

Chat Completion APIに画像・動画を投げる方法

API Reference - OpenAI API

画像： "image_url" に画像データやURLを入れる
動画：フレーム画像を複数入れる

"content": [
  {
    "type": "image_url",
    "image_url": ...
  }
]

→ 動画理解は可能だが、動画というモダリティに完全に対応しているとは言い難い

To-Be:

動画 → [GPT] → 出力

As-Is:

動画 → 複数の画像 → [GPT] → 出力

End-to-endで学習されたGPT-4oのモダリティを活かせていない
クライアントサイドで動画のデコード処理のオーバーヘッド
動画フォーマットをデコードして送るためデータ量も増える
音声情報も使用されない

（補足）Chat Completion APIのVision機能について

GPT-4 Turbo with Visionモデルがリリースされた2023年11月にAPIが追加
同月、OpenAIのAPIドキュメントにVision機能の使い方が追加
- 動画を個々のフレーム画像に分割してモデルに入力することで動画の内容理解ができると説明されている

→ つまりこのへんはGPT-4oの新機能ではない

GPT-4oのマルチモーダル入力をiOSから使う

iOSで使えるOpenAI APIクライアント

いろいろあるが、現段階では MacPaw/OpenAI でよさそう

スター数がもっとも多い（≒利用実績）
直近でもメンテされている（GPT-4o対応のRelease）
依存ライブラリなし
ストリーミングAPI、画像生成、TTS等もサポート

`MacPaw/OpenAI` は Vision API をサポートしているのか？

READMEには何も書いてないが、ソースコードを読むと実はサポートしている

ChatCompletionContentPartImageParam (ChatQuery.ChatCompletionMessageParam.ChatCompletionUserMessageParam.Content.VisionContent.ChatCompletionContentPartImageParam) の定義を見てみると、"image_url" に相当する実装がある：

public struct ChatCompletionContentPartImageParam: Codable, Equatable {
    public let imageUrl: ImageURL
    /// The type of the content part.
    public let type: String

    public init(imageUrl: ImageURL) {
        self.imageUrl = imageUrl
        self.type = "image_url"
    }
    
    ...
}

そして ImageURL 型（ChatQuery.ChatCompletionMessageParam.ChatCompletionUserMessageParam.Content.VisionContent.ChatCompletionContentPartImageParam.ImageURL）の定義を見てみると、ちゃんと画像URL or 画像データ ^[2] どちらでも渡せるようになっている。

public struct ImageURL: Codable, Equatable {
    /// Either a URL of the image or the base64 encoded image data.
    public let url: String
    /// Specifies the detail level of the image. Learn more in the
    /// Vision guide https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding
    public let detail: Detail

    public init(url: String, detail: Detail) {
        self.url = url
        self.detail = detail
    }

    public init(url: Data, detail: Detail) {
        self.init(
            url: "data:image/jpeg;base64,\(url.base64EncodedString())",
            detail: detail)
    }
    
    ...
}

詳細はこちら：
https://note.com/shu223/n/n1ad2e13b79ef

実装して試してみた

iOS×GPT-4oで画像理解

iOS×GPT-4oで画像理解

プロンプト： "What's in this image?"
入力画像はスクショにあるGPT-4oデモ動画の1シーン

出力

This image shows a person seated at a table, using a smartphone. The person is wearing a hoodie with "OpenAl" printed on it.

The image is divided into two parts:

The left side shows the person sitting at the table, looking at the smartphone, with a mug on the table that also has an OpenAl logo.
The right side is a close-up of the person's hand holding the smartphone, displaying the front screen where the person is visible, suggesting that the person is taking a selfie or using the front camera.

In the background, there is a lamp and some...（後略）

コード抜粋

VisionContent 配列を作成するメソッドの定義

private static func buildVisionContents(withImage imageData: Data, text: String) -> [Content.VisionContent] {
    return [
        .init(chatCompletionContentPartTextParam: .init(text: text)),
        .init(chatCompletionContentPartImageParam: .init(imageUrl: .init(url: imageData, detail: .auto)))
    ]
}

画像理解のためのユーザーロールの message の作成

let messages: [ChatQuery.ChatCompletionMessageParam] = [
    .init(role: .user, content: OpenAIClient.buildVisionContents(withImage: imageData, text: text))!
]

ChatQuery の送信

let query = ChatQuery(messages: messages, model: .gpt4_o, maxTokens: maxTokens)
let result = try await openAI.chats(query: query)

iOS×GPT-4oで動画の要約

プロンプト："動画の要約を提供してください。"
60sの動画から5秒おきにフレーム抽出／長辺768pxになるようリサイズ

iOS×GPT-4oで動画の要約

出力

（上のスクショとは別の試行時における出力例です）
この動画は、2021年のAppleのWWDC（Worldwide Developers Conference）での発表を示すフレームです。

最初のフレームは、青いライトで照らされた3人のキャラクターを示しており、Appleのアバター風に見えます。
次のフレームは、「Apple WWDC21」のロゴといくつかのアイコンがMacBookの画面に表示されている様子を示しています。
以降のフレームは、ある人物がプレゼンテーションを行っている様子を示しており、バックグラウンドにはデスクトップコンピュータや他の機器が見えます。
フレームには「Minimum focus distance」、「10-bit HDR video」、「Video Effects in Control Center」、「Performance best practices」、「IOSurface compression」などのテキストが表示され、特定の技術的なトピックスについて説明していることが分かります。
最後のフレームには「AVFoundation capture classes」と表示されています。

このプレゼンテーションは、カメラ技術やビデオエフェクト、パフォーマンスのベストプラクティスに関する技術的な詳細を開発者に解説する内容です。

フレームを抽出するインターバルや、画像サイズを決定する際にはOpenAIのガイドも参考にした：

"invalid_request_error"

画像の枚数が多い、あるいは1枚1枚の画像サイズが大きいと実行時に次のようなエラーになる：

OpenAI.APIError(message: "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", type: "invalid_request_error", param: nil, code: Optional("sanitizer_server_error"))

エラーメッセージでは1枚あたり20MBまで、と読めるし、ドキュメントでもそのように書いてあるのだが、どうやら複数画像を投げる際は トータルで20MBまで ということらしい。

コード抜粋

複数画像からVisionContent 配列を作成するメソッドの定義

private static func buildVisionContents(withImages images: [Data], text: String) -> [Content.VisionContent] {
    var visionContents: [Content.VisionContent] = [
        .init(chatCompletionContentPartTextParam: .init(text: text)),
    ]
    images.forEach { data in
        visionContents.append(
            .init(chatCompletionContentPartImageParam: .init(imageUrl: .init(url: data, detail: .auto)))
        )
    }
    return visionContents
}

動画から一定間隔でフレームを抽出して VisionContent 配列を作成するメソッドの定義

private static func buildVisionContents(withVideo videoURL: URL, text: String) async throws -> [Content.VisionContent] {
    // 1920x1080 -> 768x432
    let images = try await VideoUtils.extractFrames(from: videoURL, timeInterval: timeInterval, maximumSize: maxSize).map { $0.data! }
    return buildVisionContents(withImages: images, text: text)
}

動画要約のためのユーザーロールの message の作成

let visionContents = try await OpenAIClient.buildVisionContents(withVideo: videoURL, text: "これらは動画のフレームです。")
let messages: [Message] = [
    .init(role: .system, content: "動画の要約を生成しています。動画の要約を提供してください。Markdownで応答してください。")!,
    .init(role: .user, content: visionContents)!
]

iOS×GPT-4oでリアルタイム動画理解

デモ：
https://youtu.be/bF5CW3b47Ss

前フレームの説明テキストも同梱
レスポンス改善のためのポイントは後述

各サンプルのソースコード

こちらで公開しています：

ぜひビルドしてお手元でお試しください。お役に立ちましたらStarしていただけると嬉しいです。

「リアルタイム動画理解」レスポンス改善のポイント

画像理解の忠実度 / detail パラメータ

OpenAIのChat Completion APIに "detail" というパラメータがある：

messages=[
  {
    "role": "user",
    "content": [
      {"type": "text", "text": "What’s in this image?"},
      {
        "type": "image_url",
        "image_url": {
          "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          "detail": "high"
        },
      },
    ],
  }
],

iOSからリアルタイムに動画理解させるサンプルでは、"detail" に "low" を指定：

visionContents.append(
    .init(chatCompletionContentPartImageParam: .init(imageUrl: .init(url: data, detail: .low)))
)

公式ドキュメントの記述抜粋：

Low or high fidelity image understanding （低忠実度または高忠実度の画像理解）

By controlling the detail parameter, which has three options, low, high, or auto, you have control over how the model processes the image and generates its textual understanding. By default, the model will use the auto setting which will look at the image input size and decide if it should use the low or high setting.
（low、high、autoの3つのオプションがあるdetailパラメータを制御することで、モデルがどのように画像を処理し、テキスト理解を生成するかを制御できます。デフォルトでは、モデルはauto設定を使用し、画像の入力サイズを見て、low設定とhigh設定のどちらを使用するかを決定します。）

low will enable the "low res" mode. The model will receive a low-res 512px x 512px version of the image, and represent the image with a budget of 65 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.
（lowは "低解像度 "モードを有効にします。モデルは画像の低解像度バージョン512px x 512pxを受け取り、65トークンの予算で画像を表現します。これにより、APIはより高速なレスポンスを返すことができ、高いディテールを必要としないユースケースではより少ない入力トークンを消費します。）

フレーム画像のリサイズ & クロップ

上述のdetailの仕様を鑑み、また前述の20MB制限もあるので、必要最小限の画像を送信するために512x512にリサイズ&クロップしている。

リサイズ・クロップはCore Imageを使用：

func transformed(by matrix: CGAffineTransform) -> CIImage

func cropped(to rect: CGRect) -> CIImage

JPEGエンコード

画像をData型にする際に、当初は

let imageData = cgImage.dataProvider?.data as? Data

という実装にしていた。この場合、1920x1080であれば1フレームあたり 8294400 bytes = 8.29 MBもある。

これだと3フレームも送れば前述の 20 MB制限に引っかかってエラーになる。

ので、以下のようにJPEGエンコードしてから送信するようにした：

let imageData = UIImage(cgImage: cgImage).jpegData(compressionQuality: 0.8)

こうすると1920x1080で1フレームあたり 52868 bytes = 0.53 MB （※画像の内容により変動はする）

compressionQuality を下げればさらにデータは小さくなる。動画理解のサンプルでは最終的に compressionQuality は 0.3 にした。

その他

その他レスポンス改善のためのポイントや試行錯誤の過程はこちらにまとめた：

https://note.com/shu223/n/ncef4ea64bbed

ストリーミング入出力に備える

モダリティ	入力	出力
テキスト	◯	◯
画像	◯	✕
動画	ここ	-
音声	ここ	ここ

このへんをヒントに備えておくとよさそう？

Whisper API・・・音声入力API
TTS API・・・音声出力API

Whisper API（音声入力）

ストリーミング入力はサポートしてない
「音声をチャンクごとに送る」方式でストリーミングを実現しているサードパーティー実装はいろいろある

TTS API（音声出力）

ストリーミング出力をサポートしている
- ドキュメント： Text to speech - OpenAI API
- Python実装
残念ながら MacPaw/OpenAI ではストリーム出力はサポートしていなかった（ソース読んで確認）^[3]

新APIのリリースに気付くには

公式Pythonライブラリのreleaseをwatchする

まとめ

GPT-4oの「マルチモーダル」はスマホで大活躍しそう
OpenAI APIの現在の対応状況
iOSでの「GPT-4oを用いた画像理解・動画理解」の実装
ストリーミング入出力対応に備えよう

脚注

イベントでの発表を録画したものではなく、イベント終了後に改めてひとりでしゃべって撮った動画です。 ↩︎
API仕様では画像データはBase64エンコードして渡す必要があるのだが、MacPaw/OpenAI では呼び出し側でBase64エンコーディングのことまで考えなくても、Data型で渡せばライブラリ内でエンコーディングしてくれる実装になっている。 ↩︎
Contributionチャンス！ ↩︎

Discussion

ログインするとコメントできます

スライド・発表動画

GPT-4oと「マルチモーダル」

GPT-4oのモダリティ

End-to-end

「マルチモーダルコミュニケーション」の主戦場といえば・・・

もうAPIは対応しているのか？

Chat Completion API対応状況

Chat Completion APIに画像・動画を投げる方法

→ 動画理解は可能だが、動画というモダリティに完全に対応しているとは言い難い

（補足）Chat Completion APIのVision機能について

GPT-4oのマルチモーダル入力をiOSから使う

iOSで使えるOpenAI APIクライアント

MacPaw/OpenAI は Vision API をサポートしているのか？

実装して試してみた

iOS×GPT-4oで画像理解

コード抜粋

iOS×GPT-4oで動画の要約

"invalid_request_error"

コード抜粋

iOS×GPT-4oでリアルタイム動画理解

各サンプルのソースコード

「リアルタイム動画理解」レスポンス改善のポイント

画像理解の忠実度 / detail パラメータ

フレーム画像のリサイズ & クロップ

JPEGエンコード

その他

ストリーミング入出力に備える

このへんをヒントに備えておくとよさそう？

Whisper API（音声入力）

TTS API（音声出力）

新APIのリリースに気付くには

まとめ

Discussion

`MacPaw/OpenAI` は Vision API をサポートしているのか？