Closed2025/05/20にクローズ9

動画にも対応した超軽量VLM「SmolVLM2」を試す

以前試したSmolVLM
https://zenn.dev/kun432/scraps/40a2880f672c22
SmolVLM2が出ていたのを全然知らなかった
https://huggingface.co/blog/smolvlm2
256M、500M、2.2Bのパラメータを持つ3つの新モデルを発表します。 2.2Bモデルはビジョンとビデオ・タスクに最適で、500Mと256Mモデルはこれまでリリースされた中で最小のビデオ言語モデルです。
サイズが小さい一方で、メモリ消費量あたりでは既存のどのモデルよりも優れています。 Video-MME（ビデオにおける科学的ベンチマーク）を見ると、SmolVLM2は2Bレンジのフロンティア・モデル・ファミリーに加わり、我々はさらに小さなスペースでリードしている。


referred from https://huggingface.co/blog/smolvlm2
Video-MMEは、多様なビデオの種類、さまざまな時間（11秒から1時間）、複数のデータ・モダリティ（字幕と音声を含む）、および合計254時間に及ぶ900本のビデオに及ぶ高品質の専門家の注釈を幅広くカバーしているため、包括的なベンチマークとして際立っています。 詳細はこちら。

 SmolVLM2 2.2B：視覚と映像の新しいスター選手以前のSmolVLMファミリーと比較して、新しい2.2Bモデルは、画像を使って数学の問題を解くこと、写真の中のテキストを読むこと、複雑な図を理解すること、科学的な視覚的問題に取り組むことが得意になりました。 これは、さまざまなベンチマークにおけるモデルの性能に表れています：


referred from https://huggingface.co/blog/smolvlm2
ビデオタスクに関して言えば、2.2Bはコストパフォーマンスに優れている。 我々が評価した様々な科学的ベンチマークにおいて、Video-MMEでの性能は、既存の2Bモデルよりも優れていた。
我々は、Apollo： 大規模マルチモーダルモデルにおけるビデオ理解の探求で発表されたデータ混合学習のおかげで、ビデオと画像の性能のバランスをうまくとることができました。
非常にメモリ効率が良いので、無料のGoogle Colabでも実行できます。

 さらに小さく： 500Mと256Mのビデオモデル今日まで、誰もこのような小型ビデオモデルをリリースする勇気がありませんでした。
私たちの新しいSmolVLM2-500M-Video-Instructモデルは、SmolVLM 2.2Bに非常に近いビデオ機能を備えていますが、サイズはほんのわずかです。
そして、SmolVLM2-256M-Video-Instructという小さな実験があります。 これは、私たちの「もしも」プロジェクトだと考えてください-小型モデルの限界をさらに押し広げることができたらどうなるでしょうか？ 数週間前にベースとなるSmolVLM-256M-InstructでIBMが達成したことからヒントを得て、私たちはビデオ理解でどこまでできるか試してみたかったのです。 これは実験的なリリースですが、創造的なアプリケーションや特殊な微調整プロジェクトが生まれることを期待しています。

モデル。一応256Mだけはexperimentalという扱い。
https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct
https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct
MLXでも使える
https://huggingface.co/mlx-community/SmolVLM2-256M-Video-Instruct-mlx
https://huggingface.co/mlx-community/SmolVLM2-500M-Video-Instruct-mlx
https://huggingface.co/mlx-community/SmolVLM2-2.2B-Instruct-mlx

モデルカードにもあるが、前のSmolVLMと同じく、基本的に英語のみ・日本語には非対応っぽい。

experimentalではあるけども、一番軽量な256Mを試してみる。FlashAttentionが必要なのでColaboratory L4で。

モデルカードに記載のある依存パッケージをインストール

!pip install num2words
!pip install flash-attn --no-build-isolation

モデルをロード

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

VRAM消費は700MBと非常に軽量。

出力

Mon May 19 18:55:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   47C    P0             18W /   72W |     709MiB /  23034MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

サンプルコードで紹介されている画像は以下のようなもの。これを元に推論。

from IPython.display import HTML

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
HTML(f'<img src="{url}" width="600" style="border-radius:8px;">')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "Can you describe this image briefly?"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=512)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

出力

User:



Can you describe this image briefly?
Assistant: In the image, there is a flower with a bee on it. The flower is pink in color and has a yellow center. The bee is black and yellow. The flower is surrounded by green leaves. The background is blurred, but it looks like a garden or a field

In the image, there is a flower with a bee on it. The flower is pink in color and has a yellow center. The bee is black and yellow. The flower is surrounded by green leaves. The background is blurred, but it looks like a garden or a field.

（日本語訳）
画像には、蜂が止まっている花があります。花はピンク色で、中心は黄色です。蜂は黒と黄色です。花は緑の葉に囲まれています。背景はぼやけていますが、庭や野原のように見えます。

手元にある以下の神戸の画像でも試してみる。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://storage.googleapis.com/zenn-user-upload/82968d23b6c5-20250228.jpg"},
            {"type": "text", "text": "Can you describe this image briefly?"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=512)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

出力

User:



Can you describe this image briefly?
Assistant: The image depicts a view of a city with a prominent waterfront. The waterfront is lined with buildings, including a tall tower, which is a prominent feature of the city. The buildings are of various heights, with some having balconies and others having windows. The sky is clear and blue, suggesting a sunny day. The water is calm, reflecting the sky and the buildings on the waterfront.

The image depicts a view of a city with a prominent waterfront. The waterfront is lined with buildings, including a tall tower, which is a prominent feature of the city. The buildings are of various heights, with some having balconies and others having windows. The sky is clear and blue, suggesting a sunny day. The water is calm, reflecting the sky and the buildings on the waterfront.

（日本語訳）
この画像は、水辺が印象的な都市の風景を描いています。水辺には、都市の象徴的な高い塔を含む、さまざまな高さの建物が並んでいます。一部の建物にはバルコニーがあり、他の建物には窓があります。空は晴れて青く、晴れた日を連想させます。水面は穏やかで、空と水辺の建物を映し出しています。

少し質問を変えてみる。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://storage.googleapis.com/zenn-user-upload/82968d23b6c5-20250228.jpg"},
            {"type": "text", "text": "Which country is shown in this image?Answer briefly."},            
        ]
    },
]

出力

User:



Which country is shown in this image?Answer briefly.
Assistant: The image shows the city of Mumbai.

The image shows the city of Mumbai.

（日本語訳）
この画像はムンバイの街の様子です。

まあ256Mだと流石にしょうがないか。

詳しく説明してもらう。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://storage.googleapis.com/zenn-user-upload/82968d23b6c5-20250228.jpg"},
            {"type": "text", "text": "List the items shown in this image and explain them in detail."},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=2048)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

出力

User:



List the items shown in this image and explain them in detail.
Assistant: The image depicts a view of a cityscape featuring a prominent red tower and a modern building. The tower, which is the focal point of the image, is situated on the left side of the frame, while the modern building is on the right. Both structures are situated on a body of water, with the water's surface reflecting the sky and the buildings. The sky is clear, and the overall lighting is bright and sunny, suggesting a sunny day.

The image depicts a view of a cityscape featuring a prominent red tower and a modern building. The tower, which is the focal point of the image, is situated on the left side of the frame, while the modern building is on the right. Both structures are situated on a body of water, with the water's surface reflecting the sky and the buildings. The sky is clear, and the overall lighting is bright and sunny, suggesting a sunny day.

（日本語訳）
この画像は、赤い塔と近代的なビルが印象的な街並みを描いています。画像の焦点となっている塔はフレームの左側に位置し、近代的なビルは右側に位置しています。両方の建物は水面に浮かんでおり、水面には空と建物が映り込んでいます。空は晴れ渡っており、全体的に明るく晴れた日の印象を受けます。

概ね5秒以内には回答が返ってくる。

推論を何度か行った後のVRAM消費は1.2GBぐらい。

出力

Mon May 19 19:08:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   76C    P0             35W /   72W |    1289MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

動画の推論。decordが必要。

!pip install decord

以前Qwen2.5-Omniを試したときのサンプル動画を拝借する。

サンプルコードでは以下のような動画を入力として使っている（以下は画像）

動画内の音声は以下のような感じ

Hello, take a look at what I am drawing.
（日本語訳: こんにちは、私が描いているものを見てください。）

動画をダウンロード

!wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4

推論

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "draw.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=1024)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

エラー。どうもPyAVも必要みたい。

出力

ImportError: You chose backend=pyav for loading the video but the required library is not found in your environment Make sure to install pyav before loading the video.

インストール

!pip install av

でこのまま再度推論を行ってもランタイムにモジュールがロードされないようなので、一旦ランタイムを再起動する。

ランタイム再起動後に、再度モデルロード・動画の推論を実行すれば問題なく動作した。推論はだいたい15秒ぐらいだった。

出力

User: You are provided the following series of twenty-one frames from a 0:00:20 [H:MM:SS] video.

Frame from 00:00:
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:08:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:15:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:

Describe this video in detail
Assistant: The video showcases a hands-on tutorial on how to create a guitar-shaped outline on a tablet. Initially, the tablet displays a blank white screen with a simple outline of a guitar, featuring a curved neck and a pointed headstock. The user interacts with the tablet by drawing a line on the screen, which then forms a guitar shape. The process is repeated, with the user drawing a line on the screen to create a guitar outline. The video then transitions to a close-up of the tablet's screen, showing the outline of the guitar in detail. The user continues to draw the outline, ensuring the lines are precise and the shapes are consistent. The video concludes with the tablet displaying the outline of the guitar, with the user continuing to draw the outline, maintaining the consistent and detailed process.

The video showcases a hands-on tutorial on how to create a guitar-shaped outline on a tablet. Initially, the tablet displays a blank white screen with a simple outline of a guitar, featuring a curved neck and a pointed headstock. The user interacts with the tablet by drawing a line on the screen, which then forms a guitar shape. The process is repeated, with the user drawing a line on the screen to create a guitar outline. The video then transitions to a close-up of the tablet's screen, showing the outline of the guitar in detail. The user continues to draw the outline, ensuring the lines are precise and the shapes are consistent. The video concludes with the tablet displaying the outline of the guitar, with the user continuing to draw the outline, maintaining the consistent and detailed process.

（日本語訳）
この動画では、タブレットでギター型の輪郭を作成する手順を実演形式で解説しています。最初に、タブレットの画面には白い背景にシンプルなギターの輪郭が表示されます。曲線状のネックと尖ったヘッドストックが特徴です。ユーザーは画面に線を描き、その線がギターの形を形成していきます。このプロセスを繰り返し、ユーザーは画面に線を描き続けてギターの輪郭を作成します。動画は次にタブレットの画面のクローズアップに切り替わり、ギターの輪郭が詳細に表示されます。ユーザーは輪郭の描画を続け、線は正確で形状が一致するように注意します。動画は、タブレットにギターの輪郭が表示され、ユーザーが輪郭の描画を継続し、一貫した詳細なプロセスを維持する状態で終了します。

1秒ごとのフレームで画像認識してるって感じかな。

音声は認識しているんんだろうか？

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "draw.mp4"},
            {"type": "text", "text": "Transcribe the audio in the beginning of this video."}
        ]
    },
]

A person is using a stylus to draw on a tablet.

んー、動画だけど音声は認識していないように見える。

少し変わった例を試してみる。以下の動画をmp4化して読み込ませてみる。

https://youtu.be/hPVo482x7MA

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "sample.mp4"},
            {"type": "text", "text": "Describe this video"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=1024)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

_The video showcases a computer screen displaying a list of options for a user to interact with, starting with a list of options labeled 'vaginemaster' and '5.' The user is prompted to select 'vaginemaster' and then '5,' indicating a preference for the 'vaginemaster' option. The list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list continues with 'vaginemaster' and '5,' indicating a preference for the 'vaginemaster' option. The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,' 'UP TO-DATE,' and 'AVAILABLE.' The user is then prompted to select 'vaginemaster' again, and the list includes options such as 'READY,'

回答が壊れてしまったけども、多少なりとも文字は読めてるところがありそう。

複数の画像を渡すこともできる。

サンプルの画像はこんな感じ。

from IPython.display import HTML

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
display(HTML(
    f'<img src="{url1}" width="500" style="border-radius:8px;">'
    f'<img src="{url2}" width="500" style="border-radius:8px;">'
))

messages = [
    {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is the similarity between these two images?"},
          {"type": "image", "url": url1},
          {"type": "image", "url": url2},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=1024)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

出力

User: What is the similarity between these two images?









Assistant: The image captures a scene from a countryside setting, featuring a rabbit dressed in a blue coat and a blue suit, standing in a rural setting. The background features a quaint village with a stone cottage and a lush green field, adding to the idyllic atmosphere. The rabbit's position and the surrounding environment suggest a relaxed and leisurely pace, with the cottage and the field providing a sense of tranquility.

The image captures a scene from a countryside setting, featuring a rabbit dressed in a blue coat and a blue suit, standing in a rural setting. The background features a quaint village with a stone cottage and a lush green field, adding to the idyllic atmosphere. The rabbit's position and the surrounding environment suggest a relaxed and leisurely pace, with the cottage and the field providing a sense of tranquility.

（日本語訳）
この画像は、田舎の風景を捉えたもので、青いコートと青いスーツを着たウサギが、田園地帯に立っています。背景には、石造りのコテージと緑豊かな田園が広がる風情のある村が描かれており、のどかな雰囲気を演出しています。ウサギの姿勢と周囲の環境は、のんびりとしたゆったりとした時間を連想させ、コテージと田園が静けさを醸し出しています。

んー、これはイマイチ。2枚目の画像しか見てないように見える。

少し順番を入れ替えてみる。

messages = [
    {
        "role": "user",
        "content": [
          {"type": "image", "url": url1},
          {"type": "image", "url": url2},
          {"type": "text", "text": "What is the similarity between these two images?"},
        ]
    },
]

出力

User:








What is the similarity between these two images?
Assistant: Both images have a rabbit in them, but the first image has a more detailed and colorful flower in the background, while the second image has a more rustic and rural setting.

Both images have a rabbit in them, but the first image has a more detailed and colorful flower in the background, while the second image has a more rustic and rural setting.

（日本語訳）
どちらの画像にもウサギが描かれていますが、1枚目の画像には背景により詳細で色鮮やかな花があり、2枚目の画像にはより素朴で田舎の風景が描かれています。

まあちゃんと見れてない感はあるけど、さっきよりマシかな。

先日試した、llama.cppのマルチモーダル

SmolVLMだけでなく、SmolVLM2も対応モデルに含まれている。ローカルのMac（M2 Pro）にインストールしているので試してみる。上の神戸の画像を使う。

256M

llama-mtmd-cli \
    -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF \
    --image kobe.jpg \
    -p "Describe this image."

The image captures a serene view of a city's waterfront, dominated by a prominent red tower that stands out against the backdrop of the cityscape. The tower, with its distinctive red color, is a focal point of the image, and it's surrounded by a series of buildings that line the waterfront, their facades a mix of modern and historic architecture. The water, a deep blue color, reflects the sky above, adding to the overall tranquility of the scene. The perspective of the image is from the water, looking out over the city, and it's taken from a low angle, giving the viewer a sense of the height and grandeur of the city.

（日本語訳）
画像は、都市のウォーターフロントの静かな景色を捉えており、都市の風景を背景に際立つ赤い塔が目を引きます。その塔は特徴的な赤い色で、画像の焦点となっています。塔の周囲には、現代的な建築と歴史的な建築が混在するファサードを持つ建物がウォーターフロントに沿って並んでいます。水面は深い青色で、空を映し出し、シーン全体の静けさをさらに強調しています。写真の視点は水面から街を見下ろす角度で、低い位置から撮影されており、観る者に街の高さと壮大さを感じさせます。

出力

llama_perf_context_print:        load time =     434.83 ms
llama_perf_context_print: prompt eval time =     231.04 ms /    80 tokens (    2.89 ms per token,   346.26 tokens per second)
llama_perf_context_print:        eval time =     483.04 ms /   134 runs   (    3.60 ms per token,   277.41 tokens per second)
llama_perf_context_print:       total time =     841.76 ms /   214 tokens

500M

llama-mtmd-cli \
    -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF \
    --image kobe.jpg \
    -p "Describe this image."

The image captures the iconic Skydome, a renowned sports arena located in the heart of Sydney, Australia. The skyline of the city is visible in the background, with a mix of modern and traditional architecture. The sky is a clear blue, suggesting a sunny day.

The Skydome is a large, rectangular structure with a distinctive red and white color scheme. It is situated on the waterfront, with a long pier extending out into the ocean. The pier is lined with shops and restaurants, adding to the lively atmosphere of the city.

The image is taken from a high vantage point, providing a bird's eye view of the Skydome and the surrounding area. The perspective allows for a comprehensive view of the city's skyline and the surrounding waterfront.

Overall, the image presents a vibrant and dynamic cityscape, with a focus on the iconic Skydome and its surroundings.

（日本語訳）
この画像は、オーストラリアのシドニーの中心部に位置する有名なスポーツアリーナ、スカイドームを捉えています。背景には、現代的な建築物と伝統的な建築物が混在する都市のスカイラインが見えます。空は澄んだ青色で、晴れた日を連想させます。

スカイドームは、特徴的な赤と白のカラーリングを施した大型の矩形構造物です。水辺に立地し、海へと延びる長い桟橋が特徴です。桟橋沿いにはショップやレストランが並び、街の活気ある雰囲気をさらに盛り上げています。

_画像は高い位置から撮影されており、スカイドームと周辺地域を鳥瞰する視点を提供しています。この視点により、都市のスカイラインと周辺の水辺の景色を包括的に捉えることができます。

_全体として、この画像は活気とダイナミズムに満ちた都市景観を表現しており、象徴的なスカイドームとその周辺に焦点を当てています。

出力

llama_perf_context_print:        load time =     787.60 ms
llama_perf_context_print: prompt eval time =     215.10 ms /    80 tokens (    2.69 ms per token,   371.92 tokens per second)
llama_perf_context_print:        eval time =    1077.84 ms /   186 runs   (    5.79 ms per token,   172.57 tokens per second)
llama_perf_context_print:       total time =    1422.05 ms /   266 tokens

2.2B

llama-mtmd-cli \
    -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF \
    --image kobe.jpg \
    -p "Describe this image."

The image captures the serene beauty of a harbor, where the calm waters reflect the towering structures of a city skyline. Dominating the center of the frame is a striking red tower, its vibrant hue contrasting with the clear blue sky above. This tower, along with several other buildings, forms a harmonious blend of architectural styles, adding a touch of modernity to the scene.

The harbor itself is a tranquil expanse of water, its surface undisturbed by any visible activity. The buildings on the shore, including a white building with a distinctive curved roof, add a sense of depth to the image. The water's edge is marked by a sturdy pier, providing a vantage point for visitors to admire the city's skyline.

The sky above is a clear blue, devoid of any clouds, suggesting a calm and pleasant day. The horizon is visible in the distance, where the city skyline meets the sky, adding a sense of depth and scale to the image. The overall composition of the image, with the water, buildings, and sky, creates a harmonious balance, capturing the essence of a city by the sea.

（日本語訳）
この画像は、穏やかな海が都市のスカイラインの巨大な構造物を映し出す、港の静かな美しさを捉えています。フレームの中心を支配するのは、鮮やかな赤色が澄んだ青空と対照をなす印象的な赤い塔です。この塔と他のいくつかの建物は、建築スタイルの調和を成し、シーンに現代的な要素を加えています。

港自体は静かな水面が広がり、目に見える活動は一切ありません。岸辺の建物には、特徴的な曲線状の屋根を持つ白い建物が含まれ、画像に奥行き感を与えています。水辺には頑丈な桟橋が伸びており、訪問者が都市のスカイラインを望む絶好の展望ポイントとなっています。

空は雲一つない澄み切った青空で、穏やかで快適な一日を暗示しています。地平線は遠くに望め、都市のスカイラインと空が交わることで、画像に奥行きとスケール感を与えています。水、建物、空が調和した全体の構図は、海辺の都市のエッセンスを捉えたバランスの良い作品となっています。

出力

llama_perf_context_print:        load time =     232.51 ms
llama_perf_context_print: prompt eval time =    1452.06 ms /    97 tokens (   14.97 ms per token,    66.80 tokens per second)
llama_perf_context_print:        eval time =    2417.57 ms /   225 runs   (   10.74 ms per token,    93.07 tokens per second)
llama_perf_context_print:       total time =    4122.21 ms /   322 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

これとかやりたくなるよね
https://x.com/ngxson/status/1921980096421806127
https://github.com/ngxson/smolvlm-realtime-webcam
ってことでMacで256Mモデルでやってみたけど、ほぼリアルタイムでどんどん推論されていってちょっと面白かった。ただInterval 500msでもリクエストに処理が追いついていないような気がしたので1s間隔が良さそう。
ちなみにRPi４でも試してみたけど、Internval 2sでもCPUが100%に張り付いてた（4コアとも）。Interval 2秒で動いてはいるけど処理追いついてないかなーという感じ。

まとめ

個人的には動画の使い所があまり見出さえないのだけど、でも軽量で使いやすいのはいいね。

このスクラップは2025/05/20にクローズされました