Closed2025/01/28にクローズ5

HuggingFaceの超軽量VLM「SmolVLM」を試す

https://huggingface.co/blog/smolervlm
SmolVLM ファミリーに 2 つの新製品が加わりました： SmolVLM-256M と SmolVLM-500Mです。これは、256Mのパラメータを持つ、世界最小の視覚言語モデルです！
私たちはSmolVLM 2Bで学んだことを基に、効率、データ混合、新しい設計のトレードオフに焦点を当てました。わずかなフットプリントで強力なマルチモーダル性能を維持する2つのモデルを紹介できることを嬉しく思います。
このリリースには、2つの基本モデルと、パラメータサイズが256Mと500Mの2つのinstruction fine-tuningモデルという、4つのチェックポイントが付属しています。これらのモデルはtransformers、MLX、ONNXに直接ロード可能で、transformersとWebGPU（ONNX付き）用のデモもあります。このリリースのすべてのモデルとデモは、こちらでご覧いただけます。


referred from https://huggingface.co/blog/smolervlm

 概要
SmolVLM-256M – 世界最小のVLM!

SmolVLM-500M – 5億パラメータの兄弟機で、超軽量でありながら大幅なパフォーマンス向上を実現します。

新しいビジョンエンコーダの選択肢 – SigLIP 400M SO（SmolVLM 2B やその他の大型のVLMで使用されている）と、より小さいSigLIPベースのpatch-16/512を比較しました。驚くべきことに、より大きなエンコーダはわずかに良い結果しか出せませんでした。そのため、これらの新リリースでは、93MパラメータのSigLIPベースのpatch-16/512を選択しました。

より高い解像度 – 当社の小型ビジョンエンコーダーは、より高い解像度で画像を処理します（AppleのVLM研究とGoogleのPaliGemmaに着想を得ています）。これにより、最小限のオーバーヘッドでより鮮明な画像理解が可能になります。

トレーニングの最適化 – 新しいトークン化のテクニックにより、トレーニングの損失が紙面上では悪化したように見えますが、実世界のベンチマークは大幅に向上しました。
現在、SmolLM2ファミリー（135M、360M、1.7B）とほぼ同等のモデルに達しており、より小さいLLM + VLMのコンビネーションを完全に揃えることができます。
ベースモデル
https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Base
https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Base
Instructionチューニングモデル
https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct
https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
ライセンスはApache-2.0

SmolVLM-256M-Instruct

Colaboratory L4で試してみる。

Flash Attentionが必要なのでインストール。以前はビルドに時間がかかったので自前で事前ビルドしていたものを使っていたが、今は普通に入る。

!pip install flash-attn --no-build-isolation

!pip freeze | egrep -i "transformers|torch|flash-attn"

出力

flash-attn==2.7.3
sentence-transformers==3.3.1
torch @ https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp311-cp311-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu121/torchaudio-2.5.1%2Bcu121-cp311-cp311-linux_x86_64.whl
torchsummary==1.5.1
torchvision @ https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp311-cp311-linux_x86_64.whl
transformers==4.47.1

ロード前のnvidia-smi

!nvidia-smi

出力

Tue Jan 28 09:00:06 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   44C    P8              17W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

モデルとプロセッサをロード。

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# プロセッサとモデルをロード
model_name = "HuggingFaceTB/SmolVLM-256M-Instruct"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

ロード後のnvidia-smi

出力

Tue Jan 28 09:02:12 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   46C    P0              28W /  72W |    711MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

約700MBぐらい。

ではサンプルどおりに推論

# 画像をロード
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# 入力メッセージを作成
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

# 入力
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# 出力を生成
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

出力

User:



Can you describe this image?
Assistant: The image depicts a large, historic statue of liberty, located in New York City. The statue is a green, cylindrical structure with a human figure at the top, which is the actual statue of liberty. The statue is situated on a small pedestal, which is attached to a cylindrical structure that resembles the pedestal of the Statue of Liberty. The pedestal is supported by a series of columns and arches, which are typical of classical architecture.

In the background, there is a large body of water, which appears to be the Hudson River. The water is calm, with small ripples indicating that it is likely a sunny day. The sky is clear, with a gradient of colors from light blue at the top to a deeper blue at the bottom. The cityscape in the background includes a variety of buildings, including skyscrapers and high-rise buildings. These buildings are mostly made of glass and steel, and they are spread out across the cityscape.

The statue is surrounded by trees, which are mostly bare, indicating that it is winter or early spring. The trees are green, and their leaves are slightly turning brown, which is a common sight in the fall.

The statue is positioned on a small island, which is surrounded by water. The island is surrounded by a small, rocky embankment, which is typical of many island islands. The water around the island is calm, and there are no visible waves.

In the distance, there are more buildings and skyscrapers, which are part of the cityscape. The skyline is diverse, with multiple buildings and structures, including a few skyscrapers and high-rise buildings. The buildings are mostly made of glass and steel, and they are of various heights and designs.

The overall atmosphere of the image is one of peace and tranquility, as the statue is a symbol of freedom and the power of the people. The statue is a reminder of the historical and cultural significance of the Statue of Liberty, and it serves as a reminder of the importance of freedom and democracy.

### Analysis and Description

The Statue of Liberty is a significant symbol of freedom and democracy. It is a symbol of the United States of America and the people who have made it their home. The statue is a symbol of the power of the people, and it is a reminder of the importance of freedom and democracy.

### Conclusion

The image depicts a statue of liberty, located in New York City. The

推論後のnvidia-smi

出力

Tue Jan 28 09:03:59 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   54C    P0              30W /  72W |   1325MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

約1.3GB

ちなみに推論にかかった時間

出力

CPU times: user 19.4 s, sys: 99.2 ms, total: 19.5 s
Wall time: 19.5 s

SmolVLM-500M-Instruct

手順は同じなので割愛。

モデルロード後のnvidia-smi

出力

Tue Jan 28 09:09:52 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   49C    P0              22W /  72W |   1203MiB / 23034MiB |     27%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

約1.2GB

推論結果

出力

User:



Can you describe this image?
Assistant: The image depicts a cityscape featuring a prominent landmark, the Statue of Liberty, prominently displayed in the foreground. The statue is situated on Liberty Island, which is a small, rocky island located in the heart of the Atlantic Ocean. The statue is characterized by its green hue and is situated on a pedestal, which is connected to the mainland by a small bridge.

In the background, the cityscape is filled with numerous high-rise buildings, indicating a bustling urban environment. The buildings vary in height and architectural style, with some featuring glass windows and others with more traditional designs. The sky above is clear, suggesting good weather conditions, and the sun is shining, casting a warm glow over the entire scene.

The water surrounding the island is calm, with a few small boats visible, indicating that the area is a popular spot for tourists and locals alike. The water is a deep blue, reflecting the sky and the buildings, creating a serene and picturesque setting.

The statue is a symbol of freedom and democracy, representing the ideals of liberty and equality. It is a significant cultural and historical artifact, often associated with the United States and its history. The statue is a UNESCO World Heritage site and is one of the most recognizable landmarks in the world.

In summary, the image captures a cityscape with the Statue of Liberty prominently displayed on Liberty Island, surrounded by a bustling urban environment with numerous high-rise buildings. The calm water and clear sky create a picturesque and serene atmosphere.

推論後のnvidia-smi

出力

Tue Jan 28 09:10:21 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   55C    P0              36W /  72W |   1839MiB / 23034MiB |     16%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

約1.8GB

推論にかかった時間

出力

CPU times: user 13.6 s, sys: 375 ms, total: 14 s
Wall time: 15.2 s

MLX

M2 Pro Mac mini 32GB

pip install mlx-vlm

time python -m mlx_vlm.generate \
    --model HuggingFaceTB/SmolVLM-256M-Instruct \
    --max-tokens 400 \
    --temp 0.0 \
    --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vlm_example.jpg \
    --prompt "What is in this image?"

出力

==========
Image: ['https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vlm_example.jpg']

Prompt: <|im_start|>User:<image>What is in this image?<end_of_utterance>
Assistant:
 The image depicts a historical building, likely a temple or a similar structure, with a prominent golden architectural element. The building is adorned with intricate carvings and designs, which are typical of Buddhist architecture. The central structure is a tall, intricately designed structure with a series of smaller, more detailed structures, each contributing to the overall aesthetic of the building.

The building's roof is made of red tiles, and the roofline is adorned with intricate designs, including floral patterns and geometric patterns. The roofline is also adorned with gold accents, which are often used in Buddhist temples to symbolize the sacredness and divinity of the place.

The building is surrounded by a few other structures, including a smaller temple or pavilion, and a street lamp. The street lamp is positioned on the right side of the image, and it is a common feature in many Buddhist temples.

The sky is clear with a few clouds, and the overall lighting suggests that it is daytime. The image captures the building in a moment of stillness, with the focus on the intricate details of the temple's architecture.

To summarize, the image depicts a historical building with a prominent golden architectural element, surrounded by smaller structures, and a street lamp. The building is likely a temple or a similar structure, with intricate carvings and designs, and is surrounded by a red roof with gold accents. The sky is clear with a few clouds, and the overall lighting suggests it is daytime.<end_of_utterance>
==========
Prompt: 876 tokens, 1081.344 tokens-per-sec
Generation: 298 tokens, 238.899 tokens-per-sec
Peak memory: 1.604 GB

real	0m3.652s
user	0m2.210s
sys	0m1.503s

time python -m mlx_vlm.generate \
    --model HuggingFaceTB/SmolVLM-500M-Instruct \
    --max-tokens 400 \
    --temp 0.0 \
    --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vlm_example.jpg \
    --prompt "What is in this image?"

出力

==========
Image: ['https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vlm_example.jpg']

Prompt: <|im_start|>User:<image>What is in this image?<end_of_utterance>
Assistant:
 A tall pagoda with gold and blue decorations stands in front of a red and green building.<end_of_utterance>
==========
Prompt: 876 tokens, 969.314 tokens-per-sec
Generation: 21 tokens, 157.562 tokens-per-sec
Peak memory: 2.118 GB

real	0m2.598s
user	0m0.897s
sys	0m0.628s

mlx-communityに量子化されたバージョンもある

SmolVLM2が出てたのを知らなかった

https://huggingface.co/blog/smolvlm2

以下で試す予定

このスクラップは2025/01/28にクローズされました