「Gemma-3n」を試す

https://x.com/osanseviero/status/1938277414687121531
公式ブログ
https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/
この素晴らしい勢いに乗って、私たちはGemma 3nの完全なリリースを発表できることに興奮しています。 先月のプレビューではその一端をお見せしましたが、今日はこのモバイルファーストアーキテクチャのフルパワーを解き放ちます。 Gemma 3n は、Gemma の形成に貢献した開発者コミュニティのために設計されています。 Hugging Face Transformers、llama.cpp、Google AI Edge、Ollama、MLXなど、お気に入りのツールでサポートされており、特定のオンデバイス・アプリケーションを簡単に微調整し、デプロイすることができます。 この投稿は開発者向けのディープダイブです。Gemma 3nの背後にある革新のいくつかを探り、新しいベンチマーク結果を共有し、今日からビルドを開始する方法を紹介します。
Gemma 3nの新機能とは?
Gemma 3n は、昨年のクラウドベースのフロンティア モデルでしか見られなかったパフォーマンスで、強力なマルチモーダル機能をエッジ デバイスにもたらし、オンデバイス AI の大きな進歩を象徴しています。
設計によるマルチモーダル:  Gemma 3nは、画像、音声、動画、テキストの入力と出力をネイティブにサポートします。
オンデバイスに最適化： 効率性を重視して設計された Gemma 3n モデルには、有効パラメーターに基づく 2 つのサイズがあります： E2BとE4Bだ。 パラメータ数はそれぞれ5Bと8Bであるが、アーキテクチャの革新により、従来の2Bと4Bモデルに匹敵するメモリフットプリントで動作し、わずか2GB（E2B）と3GB（E4B）のメモリで動作する。
画期的なアーキテクチャ: Gemma 3nの中核には、計算の柔軟性を高めるMatFormerアーキテクチャ、メモリ効率を高めるPer Layer Embeddings (PLE)、オンデバイスのユースケースに最適化された新しいオーディオおよびMobileNet-v5ベースのビジョンエンコーダなどの新しいコンポーネントがあります。
強化された品質: Gemma 3nは、多言語（140言語のテキストと35言語のマルチモーダルな理解をサポート）、数学、コーディング、推論の各分野で品質の向上を実現しています。 E4B バージョンは LMArena スコアで 1300 以上を達成し、このベンチマークに到達した 100 億パラメータ以下の最初のモデルとなりました。

kun432

HuggingFace
https://huggingface.co/google/gemma-3n-E2B
https://huggingface.co/google/gemma-3n-E4B
https://huggingface.co/google/gemma-3n-E2B-it
https://huggingface.co/google/gemma-3n-E4B-it
mlx-vlm、ってことは画像も対応しているっぽい
https://huggingface.co/collections/mlx-community/gemma-3n-685d6c8d02d7486c7e77a7dc
llama.cppはまだテキストのみの対応
https://github.com/ggml-org/llama.cpp/pull/14400
Ollamaも多分テキストのみかな？
https://ollama.com/library/gemma3n

kun432

mlx-vlm

M2 Mac＆Python-3.11.10。

現時点ではuvだとパッケージの依存関係でコケる（多分mlx-audioが何かしらおかしそう）。素直にvenv+pipだといけた。

mkdir mlx-vlm-work && cd $_
python -m venv .venv
source .venv/bin/activate

pip install mlx-vlm
pip freeze | grep -i mlx

出力

mlx==0.26.1
mlx-audio==0.2.3
mlx-lm==0.25.2
mlx-vlm==0.2.0

E2B-itで試す

python -m mlx_vlm.generate \
    --model mlx-community/gemma-3n-E2B-it-bf16 \
    --max-tokens 1024 \
    --temperature 0.0 \
    --prompt "競馬の魅力について5つリストアップして。"

なんか日本語うまく渡ってない感。モデルの問題なのかmlx-vlmの問題なのかはわからない。

出力

==========
Files: []

Prompt: <bos><start_of_turn>user
ç«¶é¦¬ã®éåã«ã¤ãã¦5ã¤ãªã¹ãã¢ãããã¦ã<end_of_turn>
<start_of_turn>model

This appears to be a string of characters that are not easily decipherable as a standard language. It looks like a combination of:

* **Unicode characters:**  The characters themselves are from the Unicode character set.
* **Potentially corrupted or encoded data:** It's possible the string is corrupted, encoded in a specific format, or represents something that isn't meant to be read directly.
* **A specific character set or encoding:**  It might be part of a specific character set or encoding that isn't commonly used.

**Without more context, it's impossible to say what it means.**

**Here are some possibilities, but they are highly speculative:**

* **A code or identifier:** It could be a code used in a specific system or application.
* **A fragment of text:** It might be a small part of a larger text that has been cut off.
* **A random string of characters:** It could simply be a random sequence of characters.
* **A character encoding issue:**  The text might be displayed incorrectly due to a problem with the character encoding.

**To help me understand, could you provide more information? For example:**

* **Where did you find this string?** (e.g., a website, a file, a program)
* **What was the context in which you encountered it?**
* **What language do you think it might be in?**
* **Is there any other information associated with it?**



If you can provide more context, I might be able to offer a more helpful answer.




==========
Prompt: 71 tokens, 44.738 tokens-per-sec
Generation: 330 tokens, 24.597 tokens-per-sec
Peak memory: 12.123 GB

英語だと問題ない。

python -m mlx_vlm.generate \
    --model mlx-community/gemma-3n-E2B-it-bf16 \
    --max-tokens 1024 \
    --temperature 0.0 \
    --prompt "List five appealing aspects of horse racing."

出力

==========
Files: []

Prompt: <bos><start_of_turn>user
List five appealing aspects of horse racing.<end_of_turn>
<start_of_turn>model

Here are five appealing aspects of horse racing:

1. **The Thrill of the Race:**  Horse racing is inherently exciting! The speed, the power, the close finishes – it's a dynamic and unpredictable sport that keeps viewers on the edge of their seats.  The anticipation builds with each turn, and the final moments are often nail-biting.

2. **The Beauty of the Horses:**  Horses are magnificent animals, and their athleticism and grace are captivating to watch.  The sheer power and elegance of a well-trained racehorse are a visual treat.  The connection between horse and jockey is also a beautiful aspect.

3. **The History and Tradition:** Horse racing has a rich history, dating back centuries.  It's steeped in tradition, with established tracks, iconic races (like the Kentucky Derby or the Triple Crown), and a strong sense of heritage.  This historical element adds a layer of depth and fascination.

4. **The Social Aspect:**  Horse racing is a social event.  It's a great opportunity to gather with friends and family, enjoy good food and drinks, and share in the excitement of the day.  The atmosphere at a racetrack is often lively and festive.

5. **The Strategy and Skill:**  Beyond the speed, horse racing requires a great deal of strategy and skill.  Jockeys, trainers, and owners all make crucial decisions that can impact the outcome of a race.  Understanding the nuances of the sport – like pace, positioning, and horse form – is intellectually stimulating.




==========
Prompt: 17 tokens, 11.409 tokens-per-sec
Generation: 321 tokens, 24.641 tokens-per-sec
Peak memory: 11.999 GB

mlx-lmで普通のモデルだと日本語も問題ないので、なにかしらmlx-vlmもしくはモデル側の問題だと思われる（mlx-lmはGemma3nには対応していない）

python -m mlx_lm.generate \
    --model "google/gemma-2-2b-jpn-it" \
    --max-tokens 1024 \
    --prompt "競馬の魅力を5つリストアップして。"

以下の画像を使用する（自分で撮影したもの）

python -m mlx_vlm.generate \
    --model mlx-community/gemma-3n-E2B-it-bf16 \
    --max-tokens 1024 \
    --temperature 0.0 \
    --prompt "Describe this image." \
    --image kobe.jpg

出力

==========
Files: ['kobe.jpg']

Prompt: <bos><start_of_turn>user
<image_soft_token>Describe this image.<end_of_turn>
<start_of_turn>model

The image shows a vibrant cityscape situated along a waterfront. The sky is a clear, bright blue, taking up the majority of the frame.

The buildings are a mix of modern and somewhat whimsical designs. Several tall, white structures with distinctive red and white striped elements on top stand out. These buildings appear to be part of a complex, possibly a resort or entertainment area, given their unique architectural features.

The buildings are positioned right next to a body of water, likely a bay or harbor, which occupies the right side of the image. The water is a deep blue with gentle ripples, reflecting some of the buildings and the sky.

A pier or walkway extends out into the water, connecting the buildings to the shore. There are various structures and equipment on the pier, suggesting it's a functional area.

The overall impression is one of a sunny, pleasant day in a modern coastal city. The architecture is eye-catching and the scene is lively.
==========
Prompt: 272 tokens, 118.595 tokens-per-sec
Generation: 199 tokens, 23.747 tokens-per-sec
Peak memory: 13.060 GB

んー、Qwen2.5-VLでも試してみたけど、なんかいろいろ今は不安定なのかも。

python -m mlx_vlm.generate \
    --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit \
    --max-tokens 100 \
    --temp 0.0 \
    --prompt "Describe this image." \
    --image kobe.jpg

出力

AttributeError: module 'mlx_vlm.models.qwen2_5_vl' has no attribute 'AudioModel'. Did you mean: 'VisionModel'?

kun432

Transformers

Colaboratory L4で。

パッケージインストール。transformers>=4.53.0が必要なんだけど、依存している timmのバージョンが追いついてなくてエラーになる。timm>=1.0.16が必要になる。次のバージョンで治るのではなかろうか。

!pip install -U transformers
!pip install -U timm

!pip freeze | egrep -i "transformers|timm"

出力

timm==1.0.16
transformers==4.53.0

まずpipelineで。

モデルロード。

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3n-e2b-it",
    device="cuda",
    torch_dtype=torch.bfloat16,
)

VRAM消費は10GB程度。軽量感を謳ってたと思うんだけどこういうものなの？

出力

Fri Jun 27 02:29:57 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   63C    P0             31W /   72W |   10647MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

サンプルで用意されている画像。

from IPython.display import Image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"

display(Image(url))

プリントされてるのは・・・カメかな？

では推論。2分ぐらいで出力。

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "あなたは親切な日本語のアシスタントです。"}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "キャンディに描かれている動物は何？"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=1024)
print(output[0]["generated_text"][-1]["content"])

出力

キャンディに描かれている動物は、**クジラ**です。

青いキャンディに、クジラのシルエットが描かれています。

手持ちの画像でも。

この画像について説明して

はい、承知いたしました。この画像について説明しますね。

この画像は、青い空の下、海沿いに建ち並ぶ建物と、赤い色をした特徴的な建造物を捉えた風景です。

**主な要素:**

* **海:** 画像の大部分を占めており、穏やかな波が打ち寄せています。
* **建物:** 海岸沿いに複数の建物が並んでいます。白い建物や、一部に赤い屋根が見えます。これらの建物は、観光施設や商業施設など、様々な用途に使われている可能性があります。
* **赤い建造物:** 画像の中央に、赤と白の格子状の模様が特徴的な、非常に目を引く建造物があります。これは、東京タワーのようなランドマーク的な建造物である可能性があります。
* **空:** 晴れ渡った青空が広がっており、明るい日差しが降り注いでいます。

**全体的な印象:**

全体的に、明るく開放的な印象を与える画像です。海と建物のコントラストが美しく、観光地や港町の風景を思わせます。特に、赤い建造物は、その鮮やかな色と独特の形状で、強い印象を与えます。

何か特定の点についてもっと詳しく知りたいことがあれば、遠慮なくお尋ねください。例えば、建物の種類や赤い建造物が何であるかなど、わかる範囲でお答えします。

この画像の場所はどこ？

この画像は、**大阪府大阪市港区にある海遊館** の風景です。

画像中央にそびえ立つ赤いタワーは、海遊館のシンボルである**「海遊館タワー」** ですね。背景には海遊館の建物や、周囲の景色が見えます。

この書籍の表紙に書いてあることを教えて。

この書籍の表紙には、以下の内容が書かれています。

**タイトル:** ヒューマニティン・ビジネス　体質改革から成果へ

**副題:** 入門するには多そう。ふたりで読めると。

**著者:** 著: 齋藤 孝、監修: 尾野 泉、訳: 森山 健彦、瀬戸 昭明

**概要:**

* ヒューマニティン・ビジネスという言葉が書かれています。
* 「体質改革から成果へ」という目標が示されています。
* 入門には少し難しいと感じるかもしれませんが、二人で読むと理解しやすいということが書かれています。

全体的に、人間性を重視したビジネスのあり方について、初心者向けに解説する書籍であることがわかります。

この画像の請求書の内容について詳しく教えて。

はい、承知いたしました。この画像は、**春来稲荷**という会社からの請求書です。以下に詳細を説明します。

**請求書発行元:**

*   **会社名:** 春来稲荷 (しゅんらいいなり)
*   **住所:** 0580-305-V11. 月春来稲荷 (詳細な住所は記載されていません)
*   **電話番号:** 0580-305-V11

**請求内容:**

| 項目     | 単価   | 数量 | 金額   | 日付       |
| -------- | ------ | ---- | ------ | ---------- |
| 運送     | ¥20   | 1000 | ¥20,000 | (木・土A) 通常運送 (整数) |
| 運送     | ¥30   | 30   | ¥9,000  | (番台) イ *=産 |
| 運送     | ¥20   | 20   | ¥4,000  |            |
| 小計     |        |      | ¥33,000 |            |
| 合計     |        |      | **¥33,000** |            |

**請求金額:**

*   **合計金額:** ¥33,000

**その他:**

*   請求書には、**「金金金」**という表記があり、おそらく請求書番号を示していると考えられます。
*   **「発注日」**、**「納品日」**、**「備考」**などの項目は記載されていません。
*   **「合計金額：¥33,000」**と記載されています。
*   請求書の下部には、**「ご請求金額：¥33,000」**と明記されています。
*   **「※備考：ご請求書はお送り次第、ご査収ください。お急ぎの場合はお電話でお問い合わせください。」**と記載されており、請求書が送付され次第、確認するよう促しています。
*   **「※その他：基本運送費を頂戴します。」**と記載されています。

**まとめ:**

この請求書は、春来稲荷という会社から、運送サービスに関する請求書です。運送の単価は¥20、¥30、¥20で、合計金額は¥33,000です。

ご不明な点がございましたら、お気軽にお尋ねください。

こういう用途には向いてなさそう。

kun432

Gemma3nForConditionalGenerationを使ってモデルをロードする。

サンプルコードで使用されていた画像はこんな感じ。

from IPython.display import Image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
display(Image(url))

from transformers import AutoProcessor, Gemma3nForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3n-e2b-it"

model = Gemma3nForConditionalGeneration.from_pretrained(
    model_id,
    #device="cuda",  # そんなパラメータはないと言われたのでコメントアウト
    torch_dtype=torch.bfloat16,
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "あなたは親切な日本語のアシスタントです。"}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": url},
            {"type": "text", "text": "この画像について説明して。"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

出力

はい、承知いたしました。この画像について説明しますね。

画像には、鮮やかなピンク色の花が写っています。花びらは6枚で、中心には黄色い小さな花びらが集まっています。花は、緑色の葉の間に咲いており、背景はぼけていますが、緑や茶色など様々な色が見えます。

花の中には、黒と白の縞模様の美しいハチがいます。ハチは花に止まって、蜜を吸っているようです。

全体的に、夏の庭や花壇の様子を捉えた、とても生き生きとした写真ですね。ハチと花が一緒にいる様子は、自然の美しさを感じさせます。

kun432

 まとめとりあえず日本語は普通に使えそう。画像認識は、画像内の文字は厳しそうだけど、物体や風景は普通に読めてそう。あとはオーディオもいけるんだよね？
https://x.com/webbigdata/status/1938289226879259057
https://x.com/webbigdata/status/1938289229408387559
あとはお手軽に動く環境かな。mlx-vlmもいろいろ不安定に思えるし、llama.cppの対応が進むのを待ちたいところ。

kun432

Hugging Face Transformers、llama.cpp、Google AI Edge、Ollama、MLXなど、お気に入りのツールでサポートされており、

リリースに合わせて各所いろいろ奔走してる感がある。それだけいろんなフレームワークで動くことが重要になりつつあるってことなのかもしれない。

このスクラップは4ヶ月前にクローズされました