Qwen3-VL Cookbooksを試す ①Omni Recognition（オムニ認識）

Qwen3-VLクックブックをご紹介します！🧑‍🍳
ローカル展開とAPIの両方で利用できるQwen3-VLの可能性を、さまざまなマルチモーダル活用事例を通じて紹介する厳選ノートブック集です：
✅ 画像を使った思考

✅ コンピュータ操作エージェント

✅ マルチモーダルコーディング

✅ 全領域認識

✅ 高度な文書解析

✅ 正確なオブジェクト位置特定（あらゆる形式に対応！）

✅ 汎用OCR＆重要情報抽出

✅ 3D位置特定

✅ 長文文書理解

✅ 空間推論

✅ モバイルエージェント

✅ 動画理解
🔗 リンク: https://github.com/QwenLM/Qwen3-VL/tree/main/cookbooks
⚙️ API: https://alibabacloud.com/help/en/model-studio/user-guide/vision/
💬 Qwenチャット: https://chat.qwen.ai/?models=qwen3-vl-plus
このクックブックを探索し、実験しながら、マルチモーダルAIの未来を共に築いていきましょう！🚀
https://x.com/Alibaba_Qwen/status/1976479304814145877
（翻訳はPLaMo翻訳）
GitHubレポジトリのほうに各ノートブックが用意されている。
https://github.com/QwenLM/Qwen3-VL/tree/main/cookbooks
READMEを見ると各ノートブックは以下のような内容になっている。

 クックブック現在、認識・位置特定・文書解析・動画理解・重要情報抽出など、多岐にわたる機能に対応したクックブックを整備中です。動物・植物・人物・景勝地の識別に加え、自動車や商品など多様な物体認識にも対応しています。ぜひ詳細をご覧ください！

クックブック
説明
公開状況

オムニ認識
動物・植物・人物・景勝地の識別に加え、自動車や商品など各種物体の認識も可能

高度な文書解析機能
テキスト情報だけでなく、レイアウト位置情報や当社独自のQwen HTML形式を含む、より高度な文書解析を実現

フォーマット横断型の高精度物体接地
相対位置座標を用いた手法により、ボックス形式とポイント形式の両方に対応。多様な位置指定・ラベル付けタスクに対応可能

汎用OCRと重要情報抽出
自然シーンや多言語環境におけるテキスト認識性能を強化し、多様な重要情報抽出ニーズに対応

動画理解機能
動画向けOCRの精度向上、長時間動画の理解能力強化、および動画内物体の高精度接地を実現

モバイルエージェント
スマートフォン操作のための位置特定と思考処理を実行

コンピュータ操作エージェント
コンピュータおよびWeb操作のための位置特定と思考処理を実行

3Dグラウンドイング
屋内外の物体に対して高精度な3Dバウンディングボックスを生成

画像を用いた思考処理
image_zoom_in_toolとsearch_toolを活用し、画像内の微細な視覚的ディテールをモデルが正確に理解できるよう支援

マルチモーダルコーディング
マルチモーダル情報を徹底的に理解した上で、正確なコードを生成

長文ドキュメント理解
超長文ドキュメントの厳密な意味理解を実現

空間理解
空間情報を見て理解し、論理的に推論する

どこまでできるものか？を体験するために、順番に試してみようと思う。
1回目は「オムニ認識」。
https://colab.research.google.com/github/QwenLM/Qwen3-VL/blob/main/cookbooks/omni_recognition.ipynb

クックブック	説明	公開状況
オムニ認識	動物・植物・人物・景勝地の識別に加え、自動車や商品など各種物体の認識も可能
高度な文書解析機能	テキスト情報だけでなく、レイアウト位置情報や当社独自のQwen HTML形式を含む、より高度な文書解析を実現
フォーマット横断型の高精度物体接地	相対位置座標を用いた手法により、ボックス形式とポイント形式の両方に対応。多様な位置指定・ラベル付けタスクに対応可能
汎用OCRと重要情報抽出	自然シーンや多言語環境におけるテキスト認識性能を強化し、多様な重要情報抽出ニーズに対応
動画理解機能	動画向けOCRの精度向上、長時間動画の理解能力強化、および動画内物体の高精度接地を実現
モバイルエージェント	スマートフォン操作のための位置特定と思考処理を実行
コンピュータ操作エージェント	コンピュータおよびWeb操作のための位置特定と思考処理を実行
3Dグラウンドイング	屋内外の物体に対して高精度な3Dバウンディングボックスを生成
画像を用いた思考処理	image_zoom_in_toolとsearch_toolを活用し、画像内の微細な視覚的ディテールをモデルが正確に理解できるよう支援
マルチモーダルコーディング	マルチモーダル情報を徹底的に理解した上で、正確なコードを生成
長文ドキュメント理解	超長文ドキュメントの厳密な意味理解を実現
空間理解	空間情報を見て理解し、論理的に推論する

kun432

で、試そうと思ったのだが、ノートブックをざっと見た限り、
アリババクラウドのプラットフォームである DashScope でホストされている "qwen3-vl-235b-a22b-instruct" の使用が前提
これにOpenAI Python SDKを使ってアクセスする
という感じで、Transformersを使ったローカルモデルとして動かすようにはなっていない（ちょろっとコードが書いてあるけど、不完全というかそもそも実装されてない）
そこで、以前Qwen3-VL-4B-Instructを試した際に、モデルの推奨パラメータなども踏まえて書いたコードがあるので、それを使って、Colaboratory L4で試すことにする。
https://zenn.dev/kun432/scraps/b2cb6e607969c0

kun432

パッケージインストール

!pip install -U transformers
!pip install qwen-vl-utils
!pip install flash-attn --no-build-isolation

!pip freeze | egrep -i "^(transformers|qwen-vl-utils|flash_attn)"

出力

flash_attn==2.8.3
qwen-vl-utils==0.0.14
transformers==4.57.1

Qwen3-VLは、パラメーターサイズは 2B / 4B / 8B / 32B / 30B-A3B / 235B-A22B、それぞれで Instruct / Thinking のバリエーションがある。今回は Colaboratory L4のVRAM（22GB）にちょうど収まる 8B-Instruct を使うことにする。

モデルとプロセッサーをロード。

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_path = "Qwen/Qwen3-VL-8B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)

VRAM消費は17GBというところ。

出力

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   43C    P0             27W /   72W |   16953MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推論用の関数を定義。画像・動画の処理はqwen-vl-utilsを使っているが、あんまり良くわかってない。

import torch
from transformers import (
    LogitsProcessor,
    LogitsProcessorList,
)
from qwen_vl_utils import process_vision_info


class PresencePenaltyProcessor(LogitsProcessor):
    """
    Apply a presence penalty: discourage generating tokens that have already appeared
    in the generated sequence (not frequency-based, but presence-based).
    This mimics OpenAI-style presence_penalty in a simple way by subtracting a fixed
    penalty from logits of any token present at least once in the generated tokens.
    """
    def __init__(self, presence_penalty: float):
        super().__init__()
        if presence_penalty < 0:
            raise ValueError("presence_penalty must be >= 0.")
        self.presence_penalty = presence_penalty

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        # input_ids shape: (batch, cur_len)
        # scores shape: (batch, vocab_size)
        batch_size = input_ids.shape[0]
        for b in range(batch_size):
            seen = set(input_ids[b].tolist())
            if len(seen) == 0:
                continue
            # Subtract penalty from logits of seen tokens
            # Note: scores[b] is (vocab_size,)
            # Efficient masking
            indices = torch.tensor(list(seen), device=scores.device, dtype=torch.long)
            # Clamp indices to valid range just in case
            indices = indices[(indices >= 0) & (indices < scores.shape[-1])]
            if indices.numel() > 0:
                scores[b, indices] -= self.presence_penalty
        return scores


def inference(
        messages,
        max_new_tokens=16384,
        do_sample=True,
        top_p=0.8,
        top_k=20,
        temperature=0.7,
        repetition_penalty=1.0,
        presence_penalty=1.5
    ):
    """
    Generates a response from the Qwen3-VL model based on the provided messages and generation options.

    Args:
        messages (list): A list of message dictionaries in the expected format.
        max_new_tokens (int): The maximum number of new tokens to generate.
        do_sample (bool): Whether to use sampling.
        top_p (float): The cumulative probability for top-p sampling.
        top_k (int): The number of highest probability vocabulary tokens to keep for top-k sampling.
        temperature (float): The temperature for sampling.
        repetition_penalty (float): The penalty for repeating tokens.
        presence_penalty (float): The penalty for tokens that have already appeared.

    Returns:
        str: The generated text response.
    """
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    images, videos, video_kwargs = process_vision_info(
        messages,
        image_patch_size=16,
        return_video_kwargs=True,
        return_video_metadata=True
    )

    if videos is not None:
        videos, video_metadatas = zip(*videos)
        videos, video_metadatas = list(videos), list(video_metadatas)
    else:
        video_metadatas = None

    inputs = processor(
        text=text,
        images=images,
        videos=videos,
        video_metadata=video_metadatas,
        return_tensors="pt",
        do_resize=False,
        **video_kwargs
    )
    inputs = inputs.to(model.device)

    logits_processors = LogitsProcessorList()
    if presence_penalty and presence_penalty > 0:
        logits_processors.append(PresencePenaltyProcessor(presence_penalty))

    generated_ids = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        top_p=top_p,
        top_k=top_k,
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        logits_processor=logits_processors,
    )

    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    return output_text[0]

こんな感じで使える。

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "この画像について説明して。"
            },
        ],
    }
]

response = inference(messages)
print(response)

結果

この画像は、穏やかな夕暮れ時のビーチで、女性と犬が楽しそうに遊んでいる様子を捉えたものです。

主な描写：

人物とペット：

女性は、チェック柄のシャツと黒いパンツを着用し、砂浜に座っています。

彼女は笑顔で、手を広げて前足を上げた犬と「ハイタッチ」をしているように見えます。

犬は黄色いラブラドール・レトリバーで、カラフルなハーネスをつけています。前足を上げて女性と触れ合っており、非常に親しみやすい姿勢です。

背景と環境：

背景には穏やかな海と、夕日が沈む空が広がっています。

夕日の光が柔らかく、全体に温かみのあるオレンジ色のトーンを与えています。

海の波が優しく岸辺に打ち寄せているのが見えます。

雰囲気：

画像全体からは、平和で心地よい、人間とペットの絆を感じさせる穏やかな瞬間が伝わってきます。

ライトの演出により、ドラマチックかつロマンティックな印象も加わっています。

この画像は、ペットとの楽しい時間や、自然の中で過ごすリラックスした日常を表現しており、視覚的にも心地よい印象を与える作品です。

行けてそう。

kun432

ではやっと本題のノートブック。そもそも「オムニ認識」とはなんぞや？冒頭には以下と書かれている。

Qwen3-VLによる包括的な物体認識

事前学習データの品質と多様性を向上させることで、本モデルは著名人やアニメキャラクター、製品、ランドマークといった一般的な物体から、動植物に至るまで、日常生活から専門的な「あらゆるものを認識する」ニーズまで、極めて広範な対象を認識可能になりました。

このノートブックでは、Qwen3-VLを用いた包括的な物体認識の手法を解説します。画像とクエリを入力として受け取り、モデルが画像内の対象物についてユーザーの質問を解釈するプロセスを示します。

要するに色々認識できるよってことね。

クックブックで使用されているサンプル画像を取得。

!git clone https://github.com/QwenLM/Qwen3-VL
!cp -pir Qwen3-VL/cookbooks/assets .

１つ目。こういう画像。

from IPython.display import Image

Image("assets/omni_recognition/sample-celebrity-2.jpg", width=800)

どうやら中国の俳優みたい。TV番組の1シーンなのかな？文字なんかも入っている。

推論

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/omni_recognition/sample-celebrity-2.jpg",
            },
            {
                "type": "text",
                "text": "これは誰？"
            },
        ],
    }
]

response = inference(messages)
print(response)

これは**彭昱暢（Peng Yuchang）**です。

中国の俳優で、2017年の映画『団圓』で注目を集め、その後『動物園奇遇記』『過春天』『長江七號：超級星爸》など多くの作品で活躍しています。また、ドラマ『大主宰』や『長安十二时辰』にも出演しており、演技力と若々しい魅力で人気を集めています。

画像では、彼が「端庄华贵」という文字と共に、フォーマルなスーツ姿で膝をついている様子が描かれており、コメディー的な演出やキャラクターとしてのユーモアが感じられます。「???」や「砰」といった漫画風の効果も加わっており、おそらくバラエティ番組や映画のシーンからのカットである可能性が高いです。

次の画像。

複数の料理が含まれた画像となっている。

これらのバウンディングボックスの座標とそれぞれの料理の名称をJSONで出力させる。

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/omni_recognition/sample-food.jpeg",
            },
            {"type": "text", "text": "画像内の各料理を識別し、それらのバウンディングボックスと日本語名・英語名をJSON文字列で返して。"},
        ],
    }
]

response = inference(messages)
print(response)

出力

```json
[
    {"bbox_2d": [0, 195, 238, 367], "label": "白いご飯", "english": "white rice"},
    {"bbox_2d": [0, 590, 240, 842], "label": "白いご飯", "english": "white rice"},
    {"bbox_2d": [233, 93, 712, 374], "label": "スープ", "english": "soup"},
    {"bbox_2d": [518, 315, 998, 603], "label": "トマトと卵の炒め物", "english": "tomato and egg stir-fry"},
    {"bbox_2d": [235, 557, 799, 962], "label": "鶏肉と血の煮込み", "english": "chicken and blood stew"},
    {"bbox_2d": [0, 331, 470, 619], "label": "白菜と豚肉の炒め物", "english": "stir-fried cabbage with pork"},
    {"bbox_2d": [686, 129, 1000, 330], "label": "オレンジのスライス", "english": "orange slices"}
]
```

バウンディングボックスの座標は0〜1000で正規化されてるらしい。Geminiに書いてもらった。

import json
from PIL import Image, ImageDraw, ImageFont

image_path = "assets/omni_recognition/sample-food.jpeg"
image = Image.open(image_path).convert("RGB")
image_width, image_height = image.size

# Parse the JSON response (assuming the last response was the JSON)
# In a real scenario, you would get this from the model's output
json_response = """
[
    {"bbox_2d": [0, 195, 238, 367], "label": "白いご飯", "english": "white rice"},
    {"bbox_2d": [0, 590, 240, 842], "label": "白いご飯", "english": "white rice"},
    {"bbox_2d": [233, 93, 712, 374], "label": "スープ", "english": "soup"},
    {"bbox_2d": [518, 315, 998, 603], "label": "トマトと卵の炒め物", "english": "tomato and egg stir-fry"},
    {"bbox_2d": [235, 557, 799, 962], "label": "鶏肉と血の煮込み", "english": "chicken and blood stew"},
    {"bbox_2d": [0, 331, 470, 619], "label": "白菜と豚肉の炒め物", "english": "stir-fried cabbage with pork"},
    {"bbox_2d": [686, 129, 1000, 330], "label": "オレンジのスライス", "english": "orange slices"}
]
"""

detections = json.loads(json_response)

# Draw bounding boxes and labels on the image
draw = ImageDraw.Draw(image)
try:
    # Try to use a font if available
    font = ImageFont.truetype("/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf", 20)
except IOError:
    font = ImageFont.load_default()

for detection in detections:
    bbox_normalized = detection["bbox_2d"]
    label = detection["label"]

    # Scale the normalized bounding box coordinates to the actual image size
    x_min = bbox_normalized[0] * image_width / 1000
    y_min = bbox_normalized[1] * image_height / 1000
    x_max = bbox_normalized[2] * image_width / 1000
    y_max = bbox_normalized[3] * image_height / 1000
    bbox_scaled = [x_min, y_min, x_max, y_max]

    # Draw the rectangle
    draw.rectangle(bbox_scaled, outline="red", width=3)
    # Draw the label
    text_position = (x_min, y_min - 20) # Position the text slightly above the box
    draw.text(text_position, label, fill="red", font=font)

# Display the image with bounding boxes
from IPython.display import display
display(image)

最後の画像。

from IPython.display import Image

Image("assets/omni_recognition/sample-anime.jpeg", width=700)

ご存知アレ。二次創作のイラストかな？

これもバウンディングボックスとラベルを出力させる。

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/omni_recognition/sample-anime.jpeg",
            },
            {
                "type": "text",
                "text": "画像内のアニメキャラクターは誰？すべてのキャラクターのバウンディングボックスと、日本語および英語での名前をJSON形式で出力して。"
            },
        ],
    }
]

response = inference(messages)
print(response)

出力

```json
[
  {
    "name": "富岡義勇",
    "english_name": "Tengen Uzui",
    "bbox_2d": [0, 54, 469, 1000]
  },
  {
    "name": "栗花落香奈乎",
    "english_name": "Kanae Katsura",
    "bbox_2d": [198, 347, 485, 1000]
  },
  {
    "name": "胡蝶しのぶ",
    "english_name": "Shinobu Kocho",
    "bbox_2d": [432, 323, 730, 1000]
  },
  {
    "name": "不死川玄弥",
    "english_name": "Genya Shinazugawa",
    "bbox_2d": [686, 446, 1000, 1000]
  },
  {
    "name": "不死川実弥",
    "english_name": "Muichiro Tokito",
    "bbox_2d": [217, 53, 507, 631]
  },
  {
    "name": "甘露寺蜜璃",
    "english_name": "Mitsuri Kanroji",
    "bbox_2d": [432, 86, 572, 378]
  },
  {
    "name": "伊之助",
    "english_name": "Inosuke Hashibira",
    "bbox_2d": [543, 0, 686, 335]
  },
  {
    "name": "時透無一郎",
    "english_name": "Muichiro Tokito",
    "bbox_2d": [643, 118, 798, 655]
  },
  {
    "name": "煉獄杏寿郎",
    "english_name": "Kyojuro Rengoku",
    "bbox_2d": [762, 35, 1000, 557]
  }
]
```

ちょいちょい間違ってるけど、まあそれっぽく頑張ってる感はある。

kun432

次はこれ

このスクラップは2日前にクローズされました