「Intern-S1-mini」を試す

https://x.com/intern_lm/status/1958479430361461008
🔥Intern-S1-miniのご紹介！Intern-S1と同じ技術に基づく、軽量なオープンソースのマルチモーダル推論モデルです。

🥳わずか80億のパラメータで、高速な展開と簡単なカスタマイズに最適化されています。
専門的な科学分野で優れている一方、強力な一般的な能力を備えています。
80億の密な言語モデルと0.3億のビジョンエンコーダを基盤に構築されています。
実際の科学アプリケーション向けの有能な研究アシスタントです。

🤗モデル: @huggingface

https://huggingface.co/internlm/Intern-S1-mini

🤗GitHub:

https://github.com/InternLM/Intern-S1

🤗今すぐ試してみてください:

https://chat.intern-ai.org.cn
少し前にIntern-S1というReasoning対応VLMがリリースされていたのだが、要件が高くて試せなかった。
https://x.com/intern_lm/status/1949058299498016907
🚀Intern-S1、当社が開発した最も先進的なオープンソースのマルチモーダル推論モデルをご紹介します！

🥳汎用タスクにおける強力な性能と、科学的なタスクでの最先端（SOTA）性能を両立し、主要なクローズドソースの商用モデルと肩を並べる性能を実現しています。

🥰235B MoE言語モデルと6Bビジョンエンコーダーを基盤に構築。

🥰5Tトークン（50%以上が科学データ）で事前学習済み。

🥳ダイナミックトークナイザーにより、分子式、タンパク質配列、地震信号のネイティブ理解が可能。

🤗モデル: @huggingface

https://huggingface.co/internlm/Intern-S1-FP8

🤗GitHub:

https://github.com/InternLM/Intern-S1

🤗今すぐ試してみてください:

https://chat.intern-ai.org.cn
過去Internシリーズはちょこちょこ試していて、軽量かつ日本語も普通に使えるところが気にいっていたのだが、S1だけサイズが大きくて残念・・・と思っていたところに、今回手頃なサイズのS1-miniが出たということで期待。
https://zenn.dev/kun432/scraps/43367d8785f600
https://zenn.dev/kun432/scraps/df75970a343d09

kun432

モデル
https://huggingface.co/internlm/Intern-S1-mini

 Intern-S1-mini

 はじめに私たちは Intern-S1-mini を紹介します。これは Intern-S1 と同じ技術に基づいた軽量オープンソースのマルチモーダル推論モデルです。Intern-S1-mini は 8B の密な言語モデル (Qwen3) と 0.3B のビジョンエンコーダー (InternViT) を基盤とし、5兆トークンのマルチモーダルデータで追加事前学習されています。その中には 2.5兆以上の科学分野トークンが含まれており、これによりモデルは一般的な能力を維持しながら、化学構造の解釈、タンパク質配列の理解、化合物合成経路の計画といった専門的な科学分野で優れた性能を発揮します。Intern-S1-mini は、現実の科学アプリケーションにおいて有能な研究アシスタントとなります。

 特徴言語および視覚推論ベンチマーク全般で高い性能を発揮し、特に科学タスクに強みを持つ。
5兆トークンの大規模データセットで継続的に事前学習され、その50%以上が専門的な科学データであり、深いドメイン知識を内包。
動的トークナイザーにより、分子式やタンパク質配列をネイティブに理解可能。

 性能評価Intern-S1-mini を、一般データセットおよび科学データセットを含む様々なベンチマークで評価しました。以下に、最近の VLM や LLM との性能比較を示します。


カテゴリ
ベンチマーク
Intern-S1-mini
Qwen3-8B
GLM-4.1V
MiMo-VL-7B-RL-2508


一般
MMLU-Pro
74.78
73.7
57.1
73.93


MMMU
72.33
N/A
69.9
70.4


MMStar
65.2
N/A
71.5
72.9


GPQA
65.15
62
50.32
60.35


AIME2024
84.58
76
36.2
72.6


AIME2025
80
67.3
32
64.4


MathVision
51.41
N/A
53.9
54.5


MathVista
70.3
N/A
80.7
79.4


IFEval
81.15
85
71.53
71.4








科学
SFE
35.84
N/A
43.2
43.9


Physics
28.76
N/A
28.3
28.2


SmolInstruct
32.2
17.6
18.1
16.11


ChemBench
76.47
61.1
56.2
66.78


MatBench
61.55
45.24
54.3
46.9


MicroVQA
56.62
N/A
50.2
50.96


ProteinLMBench
58.47
59.1
58.3
59.8


MSEarthMCQ
58.12
N/A
50.3
47.3


XLRS-Bench
51.63
N/A
49.8
12.29

すべてのモデルは OpenCompass および VLMEvalkit を用いて評価されています。

 サービング（デプロイ）Intern-S1 シリーズモデルをデプロイするための最小ハードウェア要件は以下の通りです。


モデル
A100(GPUs)
H800(GPUs)
H100(GPUs)
H200(GPUs)


internlm/Intern-S1-mini
1
1
1
1

internlm/Intern-S1-mini-FP8
-
1
1
1


 より進んだ使い方
 Tool Calling多くの大規模言語モデル (LLMs) は現在、Tool Calling という強力な機能を備えており、外部ツールや API と対話することで機能を拡張できます。これにより、最新情報の取得、コードの実行、他アプリケーション内関数の呼び出しなどが可能になります。

開発者にとっての大きな利点は、オープンソースの LLM の多くが OpenAI API と互換性を持つよう設計されている点です。つまり、OpenAI ライブラリで使い慣れた構文や構造を、そのままこれらのオープンソースモデルでも活用できるということです。したがって、このチュートリアルで示されるコードは汎用的であり、OpenAI モデルだけでなく、同じインタフェース標準に従うあらゆるモデルで動作します。

 思考モードと非思考モードの切替Intern-S1-mini はデフォルトで思考モードを有効にしており、推論能力を高め、高品質な応答生成を可能にしています。この機能は tokenizer.apply_chat_template で enable_thinking=False を設定することで無効化できます。
なお、ライセンスはApache−2.0の様子。

カテゴリ	ベンチマーク	Intern-S1-mini	Qwen3-8B	GLM-4.1V	MiMo-VL-7B-RL-2508
一般	MMLU-Pro	74.78	73.7	57.1	73.93
	MMMU	72.33	N/A	69.9	70.4
	MMStar	65.2	N/A	71.5	72.9
	GPQA	65.15	62	50.32	60.35
	AIME2024	84.58	76	36.2	72.6
	AIME2025	80	67.3	32	64.4
	MathVision	51.41	N/A	53.9	54.5
	MathVista	70.3	N/A	80.7	79.4
	IFEval	81.15	85	71.53	71.4

科学	SFE	35.84	N/A	43.2	43.9
	Physics	28.76	N/A	28.3	28.2
	SmolInstruct	32.2	17.6	18.1	16.11
	ChemBench	76.47	61.1	56.2	66.78
	MatBench	61.55	45.24	54.3	46.9
	MicroVQA	56.62	N/A	50.2	50.96
	ProteinLMBench	58.47	59.1	58.3	59.8
	MSEarthMCQ	58.12	N/A	50.3	47.3
	XLRS-Bench	51.63	N/A	49.8	12.29

モデル	A100(GPUs)	H800(GPUs)	H100(GPUs)	H200(GPUs)
internlm/Intern-S1-mini	1	1	1	1
internlm/Intern-S1-mini-FP8	-	1	1	1

kun432

なお、FP8版とGGUF版も用意されている。
https://huggingface.co/internlm/Intern-S1-mini-FP8
https://huggingface.co/internlm/Intern-S1-mini-GGUF
ところで、llama.cppのPRを見ると、一応PRは上がってるんだけど、まだマージされていないように見える。GGUF版は果たして動くのかな？
https://github.com/ggml-org/llama.cpp/pull/15412
Ollamaにも用意されているみたい。
https://ollama.com/internlm/interns1:mini

kun432

Colaboratory L4で。

ライブラリの要件としては

transformers は 4.55.2以上
動画の場合
- decord が必要
- flash_attentionのインストールを推奨（OOM回避のため。あとGPUもx2が推奨らしい）

あたり。

Colaboratory だと確認した時点では transformers は問題なし。

!pip freeze | grep -i transformers

出力

transformers==4.55.2

動画は要件的に厳しそうなので decord は追加しないが、flash_attention は動画に関係なく有効にしたほうがいいと思うのでインストール。

!pip install flash-attn --no-build-isolation

出力

Successfully installed flash-attn-2.8.3

ではモデルとプロセッサをロード。

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_name = "internlm/Intern-S1-mini"
processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

この時点でのVRAM消費は 16.5GB程度。

出力

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   70C    P0             33W /   72W |   16497MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

まずはテキストで推論してみる。サンプルのプロンプトを日本語にした。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "興味深い物理現象について教えてください。"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

generate_ids = model.generate(
    **inputs,
    max_new_tokens=32768
)
decoded_output = processor.decode(
    generate_ids[0, inputs["input_ids"].shape[1] :],
    skip_special_tokens=True
)
print(decoded_output)

2分ほどで出力された。</think> より以前がThinkingだと思われる。開始タグがなんか出力されないのよな。

出力

Okay, the user asked about interesting physical phenomena. Let me think of some fascinating ones. Maybe start with something visually striking, like the Casimir effect. It's counterintuitive because it involves forces between plates in a vacuum. People might find that surprising.

Then there's quantum entanglement. It's a big topic in physics and philosophy. Explaining how particles can be connected regardless of distance could be engaging. But I should keep it simple without too much jargon.

What about the Mpemba effect? It's when hot water freezes faster than cold water. That's counterintuitive and has practical implications. People might have experienced it but not understood why.

Superfluidity is another cool phenomenon. Liquid helium becoming frictionless at low temperatures. It's used in experiments and has real-world applications like cooling magnets.

The double-slit experiment shows wave-particle duality. It's fundamental to quantum mechanics and demonstrates how observation affects reality. That's a mind-bender.

Black holes are always intriguing. Their event horizons and time dilation effects, like near a black hole, could be interesting. Maybe mention Hawking radiation too.

The greenhouse effect is important for climate science. Explaining how certain gases trap heat in the atmosphere helps understand global warming.

Quantum tunneling is key in electronics, like in semiconductors. It's how electrons pass through barriers they shouldn't classically. That's crucial for technology.

The Doppler effect is relatable, like the change in pitch of a siren as it passes by. It's used in radar and astronomy to measure speeds.

The photoelectric effect led to the development of quantum theory. Einstein's explanation showed light behaves as particles, which was revolutionary.

I should structure these with brief explanations, why they're interesting, and maybe some real-world applications. Make sure to highlight the counterintuitive aspects and their significance. Avoid getting too technical but provide enough detail to spark curiosity. Check if there's any overlap or if some phenomena are more accessible than others. Maybe start with the most visually or conceptually striking ones first. Also, ensure the explanations are accurate but not overwhelming. Use examples people might encounter in daily life to make it relatable. Alright, let's put this together in a coherent way.
</think>

興味深い物理現象は数多くありますが、以下にいくつかの代表的なものを紹介します。それぞれが直感とは異なる側面や、量子力学や相対性理論などの現代物理学の核心を触り抜くものもあります。

---

### 1. **カスミール効果（Casimir Effect）**
- **現象**: 真空中に平行に置かれた2枚の金属板間に、負の圧力（引力）が発生する現象。
- **なぜ興味深い？**: 真空中にも量子振動が存在し、板間の振動モードが制限されるため、エネルギー差が生じるという反直感的な説明。
- **応用**: 微小機械（MEMS）や量子センサーの設計に利用されつつある。

---

### 2. **量子エンタングルメント（Quantum Entanglement）**
- **現象**: 量子系の粒子が相関関係を保つ状態（例：スピンの向きが逆に揺らぐ）。距離を越えて瞬時に状態が変化するように見える。
- **なぜ興味深い？**: 局所性の原理（EPRパラドックス）と矛盾し、量子力学の非局所性を示す。愛因斯坦は「遠隔作用（spooky action at a distance）」と揶揄した。
- **応用**: 量子暗号や量子コンピュータの基礎技術。

---

### 3. **マペムバ効果（Mpemba Effect）**
- **現象**: 高温の水が低温の水よりも速く凍結する現象（例：沸騰水が氷に近づくと急激に冷却される）。
- **なぜ興味深い？**: 直感とは逆の結果で、原因は未解明（溶解ガスの放出、対流、過冷却など複数の要因が影響する）。
- **応用**: 冷凍技術や材料科学への応用研究が進められている。

---

### 4. **超流体性（Superfluidity）**
- **現象**: 液体ヘリウム-4が臨界温度以下で粘性が0（摩擦なし）になる現象。
- **なぜ興味深い？**: ボーゾン-アインシュタイン凝縮（BEC）と関連し、量子力学の宏偉な例。
- **応用**: 超伝導体の研究や精密計測装置（例：原子時計）の冷却源として利用。

---

### 5. **ダブルスリット実験（Double-Slit Experiment）**
- **現象**: 光や電子がスリットを通過する際に干渉縞を形成し、波粒二象性を示す。
- **なぜ興味深い？**: 觳覧観測が結果に影響し、観測行為自体が現象を決定するという量子力学の特徴。
- **哲学的意味**: 「観測者効果」や「現実の性質」についての議論の起点。

---

### 6. **ブラックホールの時間膨張（Gravitational Time Dilation）**
- **現象**: 黒洞の重力場で時間が遅れる現象（例：GPS衛星の時計修正）。
- **なぜ興味深い？**: 相対性理論の予言であり、重力が時空の構造に影響する例。
- **実例**: 黒洞周辺の星の運動観測（例：銀河中心のSgr A*）で観測されている。

---

### 7. **温室効果（Greenhouse Effect）**
- **現象**: 地球大気の温室効果ガス（CO₂、メタンなど）が太陽光を吸収し、地球を暖める現象。
- **なぜ興味深い？**: 自然現象として必要だが、人為的CO₂放出が加速し、気候変動の原因となった。
- **社会的影響**: 全球温暖化問題の科学的根拠。

---

### 8. **量子トンネル効果（Quantum Tunneling）**
- **現象**: 量子力学的に、粒子がエネルギー障壁を越える現象（例：半導体の素子動作）。
- **なぜ興味深い？**: 経典力学では不可能な現象で、量子力学の確率性を示す。
- **応用**: ナノテクノロジー、核融合反応の理解に不可欠。

---

### 9. **ドップラー効果（Doppler Effect）**
- **現象**: 波源と観測者の相対速度が波長/周波数に影響し、音や光の色が変化する現象（例：警笛の音高変化）。
- **なぜ興味深い？**: 雷電の光の赤シフト、宇宙の膨張（ハッブル法則）の観測の基盤。
- **応用**: 雷達、医学画像診断（超音波）、天体物理学。

---

### 10. **フォトエミッション効果（Photoelectric Effect）**
- **現象**: 光が金属表面に当たると電子が放出される現象（光の粒子性の証拠）。
- **なぜ興味深い？**: エインシュタインの光量子仮説で、光の波動説と矛盾し、量子力学の誕生を促した。
- **応用**: 光電管、太陽電池、光センサー。

---

### 最終的なメッセージ
これらの現象は、物理の基礎を深く掘り下げ、私たちの直感を超えた世界のしくみを示しています。特に量子力学や相対性理論は、現実の構造を再考させ、技術革新の源泉となっています。興味深い現象は、単に「驚き」だけでなく、科学の限界を越える可能性を示唆しています。

推論後のVRAM消費は微増。

出力

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   75C    P0             68W /   72W |   16923MiB /  23034MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

一応パラメータの推奨値が提示されている。

サンプリングパラメータ

より良い結果を得るために、以下のハイパーパラメータを推奨します。
top_p = 1.0
top_k = 50
min_p = 0.0
temperature = 0.8

これを設定するにはこんな感じ。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "興味深い物理現象について教えてください。"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

generate_ids = model.generate(
    **inputs,
    max_new_tokens=32768,
    # 推奨パラメータを追加
    do_sample=True,
    top_p=1.0,
    top_k=50,
    min_p=0.0,
    temperature=0.8,
)
decoded_output = processor.decode(
    generate_ids[0, inputs["input_ids"].shape[1] :],
    skip_special_tokens=True
)
print(decoded_output)

出力

Okay, the user asked about interesting physical phenomena. Let me think of some cool ones. Maybe start with something visual, like a vortex. The coffee cup vortex effect where you stir and then the liquid spirals down. That's pretty cool because it's something you see every day but might not realize the physics behind it. Then there's the Tacoma Narrows Bridge collapse, which is a classic example of resonance and aerodynamics. People often know the story but might not get the physics details. Oh, and the Möbius strip, which is a mathematical shape but has interesting physical implications. Maybe the Casimir effect? It's quantum physics, showing forces between plates in a vacuum. That's pretty mind-blowing. Also, quantum entanglement is a big one, though it's more quantum mechanics. The user might appreciate something relatable like the Doppler effect, which you hear in cars speeding by. Or maybe the Leidenfrost effect, where a liquid can levitate on its vapor. That's visually striking. And don't forget about the shock waves from supersonic objects, like the sonic boom. Oh, and the Mpemba effect where hot water freezes faster than cold, which is still debated. Also, the quantum Zeno effect where frequent observation freezes a quantum state. Maybe the Bose-Einstein condensate, where particles behave as a single entity at low temps. The user might like the simplicity of the vortex explanation first, then build up to more complex topics. Need to check if these examples are accurate. Vortex is definitely due to angular momentum conservation. Resonance in the bridge collapse is correct. Casimir effect is real but maybe a bit advanced. Leidenfrost effect is good for visuals. Doppler effect is straightforward. Maybe start with the vortex and then go into more complex ones like quantum phenomena. Also, ensure that each example explains the phenomenon clearly and why it's interesting. Avoid jargon where possible. Make sure to mention the relevance to everyday life or unexpected aspects. Like how the vortex starts spinning when you stop stirring because of angular momentum transfer. Or how the Mpemba effect challenges intuition. That should cover a range of topics from classical to quantum, visual to theoretical. Need to present them in a way that's engaging and informative without overwhelming the user.
</think>

興味深い物理現象は実に種類が多く、日常の小さな事象から量子力学の奇妙な世界まで幅広くあります。以下に代表的な例をいくつか紹介します：

---

### 1. **コップの渦（液体の回転現象）**
   - **現象**: コップに水を回転しながら入れると、水面が渦をつくります。攪拌をやめても、液体は徐々に中心に向かって回転を続けます。
   - **物理の原理**: 角運動量の守法則（物理学の基本法則の1つ）。液体が回転しているとき、その運動量（角運動量）は保存されるため、攪拌をやめても液体全体が一緒に回転を続けます。この現象は宇宙の巨大な渦や惑星の自転軸のずれ（北極星の変化）にも繋がります。

---

### 2. **タcoma海峡橋の崩壊（振動・共振現象）**
   - **現象**: 1940年に日本橋が風に揺らめいて崩壊した事故。風の振動周波数が橋の固有振動数と一致し、共振現象が発生したと考えられます。
   - **物理の鍵**: 振動・波の知識。共振は音楽の音色や建築物の耐震性設計でも重要な概念です。これにより、風洞実験や振動解析が重要になりました。

---

### 3. **モービアスストリップ（数学と物理学の境界）**
   - **現象**: 1つの端口で始まるストリップを180度回転して接着すると、片側面だけを持つ形状になります。これにより、紙を曲げると裏表が入れ替わります。
   - **物理的意義**: 非ユークリッド幾何や拓撲学の応用例。物理学では、量子力学や弦理論で高次元空間の説明に使われます。

---

### 4. **カスミール効果（真空の力）**
   - **現象**: 真空中に平行に配置した2枚の金属板がわずかに引力で引き寄せられる現象です。
   - **物理の矛盾**: 計算上0でない力が現れる「真空のエネルギー」と関係しています。これは量子場の理論の例で、未解の問題（真空エネルギーの実験値と理論値の巨大な差）にも繋がります。

---

### 5. **レーデンフォース効果（液体の反重力現象）**
   - **現象**: 微软雅黑な液体膜が蒸気層で固体表面から離れて浮遊する現象。例：熱い水を鍋に溜めると液体が氷のように浮きます。
   - **応用技術**: ハイスピードカメラやマイクロ流体デバイスの冷却剤として利用されています。

---

### 6. **ドップラー効果（音と光の歪み）**
   - **現象**: 移動する物体から発する波の周波数が観測者の運動に依存して変化します（例：急速に近づく警笛が高音、遠ざかると低音になります）。
   - **実用例**: 自動車の速度計や宇宙の星の移動測定（赤シフト・青シフト）に使われています。

---

### 7. **ボース・インスターナンデー凝縮体**
   - **現象**: 非常に低温（約1ナノケルビン）に冷却された原子がボース粒子として単一の量子状態に集団化する現象。
   - **量子力学の証拠**: 微視的な粒子が巨視的な波の性質（干渉現象）を示す例です。これにより、超流動性や超伝導の理解が進みます。

---

### 8. **マペムバ効果（水の凍結の謎）**
   - **現象**: 高温の水が低温の水よりも早く凍る現象（例：沸騰水が氷になるのに普通の水より早くなる現象）。
   - **未解の問題**: 液体の状態や空気の沸点依存性、溶解物の影響が関連するが、完全に解明されていません。

---

### 9. **ショックウェーブ（超音速の音爆）**
   - **現象**: 超音速で飛行する物体が空気分子を圧縮して発生する衝撃波（音爆）。音爆の衝撃波は地面で広がり、不快な音を生みます。
   - **応用例**: 高速カメラや爆発物の爆発波の解析に使われています。

---

### 10. **量子ゼノエフェクト（観測が物質の状態を固定する）**
   - **現象**: 高頻度で観測される量子系が、状態が崩壊しようとする瞬間を「凍み」こみ、状態を維持する現象です。
   - **哲学的意義**: 「観測が現実を決定する」という量子力学の観念に深く関わっています。

---

### 特徴的な共通点:
- **直感的反応**: 日常の現象を眺めても物理の法則が驚くほど影響しています。
- **計量学の限界**: レーザー測定や計測技術が未だ解明できない謎を残しています。
- **実用と基礎の橋渡し**: ダイナモ電機の開発（渦現象）、レーザー冷却（レーデンフォース効果の応用）など、基礎研究から技術革新まで繋がっています。

興味深い物理現象は、単に「なぜ？」という疑問に答えるだけではなく、宇宙の構造や自然の美しさを理解する鍵ともなっています。どのようなテーマに興味があるら、さらに詳しく解説できます！

kun432

画像も入力してみる。サンプルはこういう画像になっていた。

from IPython.display import Image
import requests

response = requests.get("http://images.cocodataset.org/val2017/000000039769.jpg")
Image(response.content)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "この画像の内容を具体的に説明してください。"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

generate_ids = model.generate(
    **inputs,
    max_new_tokens=32768,
    # 推奨パラメータを追加
    do_sample=True,
    top_p=1.0,
    top_k=50,
    min_p=0.0,
    temperature=0.8,
)
decoded_output = processor.decode(
    generate_ids[0, inputs["input_ids"].shape[1] :],
    skip_special_tokens=True
)
print(decoded_output)

出力

好的，我现在需要仔细分析用户提供的这张图片，并生成一个详细的描述。首先，我需要确认用户的需求是什么。用户可能希望得到关于图片中内容的准确、全面的描述，可能用于写作、绘画参考、社交媒体分享或其他用途。因此，我需要确保描述尽可能详尽，同时保持条理清晰。

首先，我会观察图片的整体场景。图片显示两只猫躺在一张粉红色的沙发上，周围有遥控器。接下来，我会分区域描述各个元素，确保不遗漏任何细节。例如，沙发的颜色、材质，两只猫的外观特征，包括毛发颜色、花纹、佩戴的项圈，它们的姿态和表情，遥控器的位置和数量，以及背景的其他细节。

接着，我需要考虑图片的角度。看起来像是从上方俯视拍摄的，这可能影响到某些元素的可见性，比如猫的背部和遥控器的位置。因此，在描述时需要指出这一点，以帮助读者更好地想象画面。

然后，我会逐一检查每个细节。例如，左边的猫是小猫，蜷缩着，右边的猫是较大的成猫，侧卧。它们的毛色都是棕黑相间的虎斑纹，但花纹可能略有不同，需要注意描述清楚。项圈的颜色和材质也需要提及，比如左边的猫是白色项圈，右边的猫是黑色项圈。

遥控器的数量和摆放位置也需要具体说明，比如左边有一个白色遥控器，右边有一个橙色或粉红色带白色按键的遥控器，可能还有一部分被挡住的。此外，沙发的材质看起来是绒面革或布艺，颜色是鲜艳的粉红色，这增加了画面的温馨感。

背景中是否有其他元素？比如沙发后面的墙壁颜色，是否有其他家具或装饰物？根据图片，可能没有明显的背景细节，但如果有，也需要提及。不过根据提供的图片，可能背景比较简单，所以可以省略。

接下来，我会考虑描述的结构。通常，可以从整体到局部，从左到右或从上到下，按照逻辑顺序描述。例如，先描述沙发，再描述两只猫的位置和姿态，然后描述遥控器的位置，最后补充细节如项圈颜色、毛发纹理等。

需要注意避免主观判断，比如“看起来很舒服”这样的描述，而是用客观的词汇，如“悠闲地躺着”、“蜷缩着身体”。同时，要确保描述准确，比如猫的品种是否可以确定？根据花纹，可能是家猫中的虎斑猫，但不确定具体品种，所以可以描述为“棕黑色虎斑猫”。

另外，需要确认是否有其他值得注意的细节，比如猫的胡须、爪子的颜色、是否有阴影效果等。例如，左边的猫的爪子是黑色的，右边的猫的尾巴上有白色斑纹。这些细节能增加描述的丰富性。

最后，检查是否有遗漏的部分，比如是否有其他遥控器或物品，或者猫的头部朝向等。例如，两只猫的头部朝向不同，左边的猫头朝向左侧，右边的猫头朝向右侧，这可能是由于它们各自的位置所致。

总结来说，我需要将整个图片分解为各个组成部分，逐一详细描述，确保用户能够通过文字清晰地重构画面。同时，保持语言流畅，避免重复，并且结构清晰，便于读者阅读和理解。
</think>

この画像は、パティ廳の上から俯瞰した状態で撮影された懐かしい場面を捉えています。画面全体を占めるのは、濃いピンクのソファーで、その上に2匹の虎斑猫がリラックスした姿勢で寝伏せています。以下に具体的な要素を分解して記載します。

---

### **1. ソファーの特徴**
- **色と素材**: 濃い輝きのあるピンク色で、素材は疑似皮革（レザー風）である可能性が高め。表面に穏やかな光沢が見られ、触感は柔らかい印象です。
- **形状と構造**: 立体的なポッカリとした座布団の形状が印象的で、猫が寝ている場所は中央部分に位置しているようです。高さや幅の目安は不明ですが、猫の体長と比較すると比較的小型のソファーである可能性があります。

---

### **2. 左側の幼猫**
- **姿勢**: 中央に位置し、体を丸めながらリラックスしたポーズを取っています。前足は曲げて頭下ろしに、後ろの足は座布団の縁に張り付けており、まるで「寝ている間も緊張感を忘れない」といった雰囲気です。
- **外見**: 
  - **毛色**: 主に褐色と黒色が細いストライプで混じった虎斑紋で、体幹には濃く染まっている部分と、首や腹部に白い色が混じり、斑紋の濃淡が繊細に表現されています。
  - **項圈**: 白色の項圈を着用しており、細い金属製の部分が見えます。おそらくカバーのない簡易なタイプで、幼猫に適したデザインかと推測されます。
  - **爪と尾**: 爪は黒色で、少し外側に曲がっている。尾は白と黒の調和感のある斑紋があり、毛量が多く優美的な曲線を帯びています。

---

### **3. 右側の成猫**
- **姿勢**: 左側の幼猫の横に座布団の上を広く広げて寝伏せており、体の大部分が広がっています。前足は座布団の縁に張り付けられ、後ろの足は柔らかく曲げられています。頭は目を閉じて、極めてリラックスした表情を浮かべています。
- **外見**: 
  - **毛色**: 虎斑紋がより豊かで、濃い褐色と黒色のストライプが織り交ぜてあります。体幹や首の部分に濃い色が集まり、腹部や首には白い部分が広範囲に広がっています。
  - **項圈**: 黒色の項圈を着用しており、細いデザインで、幼猫の項圈よりも簡潔な印象です。
  - **表情と細部**: 
    - 鼻先は黒色で、髭毛は白く色濃く伸びています。
    - 右耳は曲がり耳で、黒色の小さな点が耳内に輝いています。
    - 眼瞼は完全に閉じており、目元の毛が優しい弧を描いています。

---

### **4. 遥控器の配置**
- **左側の遙控器**: 幼猫の脇の下に位置し、白色で、一部が座布団の縁に重なっています。キーの配置は不明ですが、色の濃淡が見られ、端末の形状が一般的なテレビ遙控器と一致しています。
- **右側の遙控器**: 成猫の体の近くに置かれ、橙色と白色の組み合わせで、形状はカスタマイズされた可能性が高い（例：ゲームコントローラーのようなデザイン）。一部が隠れていますが、赤橙色の側面が目立ちます。両器は座布団に平行して並び、猫の間に置かれた様子です。

---

### **5. 全体の雰囲気と空間感**
- **配色**: ピンクの座布団と虎斑紋の猫の自然な黒・褐色がコントラストを生み、視覚的に温か味のあるバランスです。白や橙色の遙控器が色を華やかに演出しています。
- **空間の要因**: 背景には他の家具やインテリアが見られず、視点が集中している分、ソファーと猫が主役となっています。光は自然光か暖色系の室内光で、猫の毛の質感が柔らかく見えます。

---

この一瞬が、猫達の平和なるリラクゼーションタイムを凝縮した静かなイラストのような存在です。幼猫と成猫の対比が魅力的で、日常の快適さを具現化させています。

kun432

いつもVLMで試している通り、手持ちの画像を使って出力を確認。プロンプトやパラメータは全て同じで画像URLだけを適宜変更する。
image_url = "......"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": image_url},
            {"type": "text", "text": "この画像について詳しく説明して。"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

generate_ids = model.generate(
    **inputs,
    max_new_tokens=32768,
    # 推奨パラメータを追加
    do_sample=True,
    top_p=1.0,
    top_k=50,
    min_p=0.0,
    temperature=0.8,
)
decoded_output = processor.decode(
    generate_ids[0, inputs["input_ids"].shape[1] :],
    skip_special_tokens=True
)
print(decoded_output)
神戸の風景

推論結果
Okay, let's see. The user provided an image and wants a detailed explanation. First, I need to identify the elements in the photo. There's a body of water in the foreground, probably a bay or harbor. Along the coastline, there are several modern buildings and structures. The most prominent one is a tall red tower with a lattice structure, which looks like the Tokyo Tower. There's also a distinctive white building with a curved, tent-like roof, which might be the Tokyo Big Sight. A white boat or ferry is visible on the water, docked near the buildings.
I should mention the location, which is likely Tokyo, Japan, given the landmarks. The red tower is definitely Tokyo Tower, and the Big Sight is nearby. The user might be interested in knowing about these landmarks, their significance, and the surrounding area. Maybe they're planning a trip or doing research. They might also want to know about the architecture, the history of the buildings, or recreational activities in the area. I should structure the explanation starting from the foreground (water), then the buildings, identify the landmarks, and provide some context about their purpose and history. Also, check if there's any other notable details like the mountains in the background, which could be Mount Fuji, but since it's not visible here, maybe other hills around Tokyo. Need to make sure the information is accurate and relevant to the image provided.

</think>
この画像は、日本の東京にある隅田川や隅田川クルーズの様な景観を捉えたものです。以下に詳細を説明します：

 主なランドマークと建築物
赤レンガ塔（東京タワー）
画面中央に位置する赤い鉄塔は、東京タワーです。
明治時代に建設された電信塔ですが、戦後再建され、观光スポットとして有名です。
高さは634メートルで、東京スカイツリーよりも低いが、日本の地標として親しまれています。
丸の内エリアのビル群
左側の白色なモダン建築群は、丸の内地区のビジネスビル群です。
中央には「スギウラホテル」と書かれた建物があり、東京のビジネスエリアの一部です。
白いカモメ巣状建築（東京ビッグサイト）
東京タワーの右側に見える白いカモメ巣状の建築は、東京ビッグサイトです。
国際的なイベントや展示会の会場として活用されており、カモメ巣のようなデザインが特徴です。

 背景の自然景観
背景には緑の丘々が見え、東京湾周边の山々（たとえば隅田川沿岸の荒川や隅田川の支流）を表しています。
天空は晴れ気味で、遠方の景色が視界に余裕があります。

 水面上の要素

クルーズボート

水面上に浮かぶ白いボートは、隅田川クルーズや東京湾クルーズ用のボートです。
隅田川クルーズは、東京タワーや晴空塔、東京スカイツリーを一望できる名所です。
ボートの運行は観光目的で、夜間は華やかなライトショーを楽しむこともできます。

 場所と周辺の特徴

隅田川

東京の幹線河川で、江戸時代から交通の要衝でした。現在も文化や歷史が残る街並みを環繞しています。

丸の内地区

東京のビジネス中心地の1つで、東京駅や丸の内通りが起点です。

東京ビッグサイト

東京国際展示場として、世界のイベントや国際的な展示会が開催されています。

 歷史的・文化的背景

東京タワーの由来

原は1872年に電信用の白塔でしたが、大戦後再建し、現在の赤レンガ塔の姿に至りました。

隅田川の文化

日本国宝の隅田川関宿伝統舞や、文化財の隅田川関宿の船や、伝統的な町家が並び、歴史と現代が融合した場所です。
この画像は、東京の都市景観と自然との調和を表しており、観光やビジネス、文化活動の中心地を一望できます。東京タワーの赤い姿は世界中から認識される東京の象徴です。
書籍の画像

推論結果
Okay, let's take a look at this image. The user provided a picture of what seems to be a Japanese document or pamphlet related to "コメントのコミュニーションの全体制観" which translates to "An Overview of Communication Systems in Comments." The document includes some structured diagrams and text in Japanese. The main elements are boxes with text, arrows connecting them, and color-coded sections with green and white colors.
First, I need to parse the text. The title at the top mentions "コメントのコミュニケーションの全体制観" and "上記原則の手順 中川大介氏." Translating that, it's "An Overview of Communication Systems in Comments" followed by "The procedure of the above principles by Daisuke Nakagawa." So this is likely an academic or educational material from Daisuke Nakagawa, discussing principles related to comment communication systems.
Looking at the diagram, there's a flowchart with three connected boxes labeled "投稿者" (Poster), "コメント者" (Commenter), and "管理者" (Administrator). Each box has smaller text underneath. "投稿者" has "投稿内容の品質向上," "コメント者" has "自己表現とコミュニケーションの促進," and "管理者" has "不適切コメントのモニタリング." So the flowchart illustrates how the interaction between posters, commenters, and administrators contributes to improving content quality, promoting self-expression and communication, and monitoring inappropriate comments.
The green section at the bottom has bold text "解・動・早で進めよ" which translates to "Proceed with Understanding, Action, and Speed." The smaller text underneath says "解っていたのですが動いていきたいな気持ちで作られました" which means "It was made with a feeling of wanting to move forward even though I understood it." This suggests that the document is intended to guide users towards implementing these principles effectively by first understanding them, taking action, and doing so quickly.
The structure of the document seems to outline three interconnected roles in a comment system: the person posting content, those commenting on it, and the administrators managing the platform. Each role plays a part in maintaining a healthy community. The flowchart shows the interaction between these roles, emphasizing the importance of each in sustaining effective communication.
The principles mentioned (like "上記原則の手順") likely refer to specific guidelines or steps that Nakagawa outlines for optimizing comment systems. The emphasis on understanding, acting, and speeding up might be a call to action for users and administrators to actively engage with these principles to improve their comment systems.
I should check if there are any other elements in the image that I might have missed. The boxes are connected with arrows, indicating a flow or process. The green color in the bottom section might signify action or urgency, aligning with the "解・動・早" message. The document appears to be educational, possibly part of a presentation or a manual for managing online comment sections effectively.
Possible applications of this framework could be in social media platforms, forums, or any online community where user-generated comments are prevalent. The goal is to create a balanced environment where content quality improves, self-expression thrives, and moderation prevents harmful content. Each role's responsibilities are clear: posters provide content, commenters engage constructively, and administrators ensure the system runs smoothly.
I should also consider the target audience. Since it's authored by Daisuke Nakagawa, maybe an expert in communication systems or online platforms, the intended audience could be platform administrators, community managers, or educators teaching about digital communication. The document serves as a guide to implement best practices in managing comment systems effectively.
In summary, the image presents a structured overview of how different stakeholders (posters, commenters, administrators) interact within a comment system to achieve effective communication, guided by principles that encourage understanding, proactive action, and timely responses.

</think>
この画像は、コメント機能のコミュニティ形成に関する総観的手法を示す文書であり、 「コメントのコミュニケーションの全体制観」 というタイトルの下に具体的な手順やプロセスが示されています。以下に詳細を説明します。

 1. 主な構成要素
タイトルと著者
「コメントのコミュニケーションの全体制観」
コメント機能を通じたコミュニケーションシステムの全体像を解説する内容

「上記原則の手順 中川大介氏」
中川大介氏（仮名）によって、上記の原則を実践するための手順が示されたと推測されます。

フローチャート

3つの役割の連携

投稿者（投稿内容の品質向上）
コメント機能を通じた自己表現や情報共有を促し、投稿内容の価値を高める役割。


コメント者（自己表現とコミュニケーションの促進）
コメントを通じて意思疎通やフィードバックを提供し、コミュニティの活性化を図る役割。


管理者（不適切コメントのモニタリング）
機会適応や規制の観点から、不適切なコメントを監視・管理し、コミュニティの健全性を維持する役割。



相互作用の流れ

投稿者→コメント者→管理者の閉じたシステムとして、各ステークホルダーの連携が示されています。
下部のメッセージ

「解・動・早で進めよ」

理解（解）→ 行動（動）→ 整備（早）

- コメント機能を効果的に活用するためには、まず原理を理解し、次に即時的な実装を、最後に整備・改善を重ねていく必要性を強調。


「解っていたのですが動いていきたいな気持ちで作られました」
自分自身が理解している内容を、行動に移す意欲を共有した形で文書が作成されたと伝えています。


 2. コメントシステムの目的と課題
目的
コメント機能を通じて、投稿者と読者の間で 相互協同関係（共生系） を形成し、活性化・信頼性を高める。
コミュニティの成長を支援するためのシステム設計の指針を提供。
課題
不適切コメント（スパム、誹謗、イデオロギーの干渉）
投稿内容の質の低下や過剰なノイズ（冗長なコメント）
管理者の負担増と効率化の必要性

 3. 中川氏の提案
「解・動・早」の実践例

理解（解）
コメントシステムの規則や目的を明確に伝える（投稿者の意識向上）。


行動（動）
即時的な不適切コメントの削除や、投稿者のフィードバックを迅速に反映させる。


整備（早）
データ分析を通じて、システムの改善点を定期的に追加（AIによる不適切コメント検出の導入など）。

共生系の維持
投稿者とコメント者の双方向の満足度向上を目指し、管理者が適切な制衡（ルールの明確化）を実施する必要性を重視。

 4. 対象用途
SNS、フォーラム、オンライン掲示板など、コメント機能が組み込まれたプラットフォームの管理者向けガイド
コンサルティングや研修資料としての利用も想定
コミュニティ形成の学術的研究や政策立案への参考資料

 5. 観点からみる価値

理論的側面
社会技術学やシステム思考の視点から、人間と技術の相互作用を考察。


実践的側面
即時性と効率性を重視した、実用的なシステム設計の手順を提供。


倫理的側面
自己表現の促進と規制の平衡点（自由と秩序）の重要性を強調。

この文書は、オンラインコミュニティの活性化を目指すためのコンセプトを、理論と実践の両面から解説しています。特に、管理者の役割とその実施手順に焦点を当て、効果的なコメントシステムの構築方針を示唆しています。
架空の請求書の画像

推論結果
Alright, let me try to figure out how to explain this image in detail. First, I need to understand what the image shows. It looks like a Japanese invoice or purchase request document. Let me break down each part step by step.
Starting with the top section: There's a title "御請求書" which translates to "Purchase Request" or "Invoice Request." Then there's a reference number "INV-2024-0820." Below that, there's a company name "模範商事株式会社之印" which would be "Model Business Co., Ltd." with "Seiin" as an additional part, possibly indicating a branch or another designation. The address is in Tokyo's Koto Ward, a phone and fax number are provided too.
Next, there's the recipient's section: "範例工業株式会社 御中" which means "To Example Industrial Co., Ltd." So this is a request from Model Business Co., Ltd. to Example Industrial Co., Ltd.
Then the table starts listing items. The columns are "項目" (item), "数量" (quantity), "単価" (unit price), and "金額" (amount). The items listed are special A4 paper costing ¥50 each, totaling 1000 units for ¥50,000; high-grade ink priced at ¥2,000 each, 20 units totalting ¥40,000; and "筆セット (各種)" which I think translates to "Pen Set (Various)" at ¥1,000 each, 50 units for ¥50,000. The subtotal is ¥140,000.
Then there's consumption tax (消費税) at 10%, which is ¥14,000, making the total ¥154,000. The notes at the bottom (備考) mention payment due within 30 days from invoice date, transportation costs borne by the requester, and contact information for any questions. The responsible person is from the Business Department, Mr. Taishi Kitanaga.
I need to make sure I'm translating all parts correctly. For example, "御中" is a polite term indicating "to you, [company name]." The tax calculation: 10% of 140,000 is 14,000, yes. The total is correct. The notes clarify payment terms, transportation responsibility, and contact for inquiries.
I should also note the format: tables, official stamps, and the layout typical of Japanese business documents. Maybe mention the official stamp "之印" (Seiin) which is common in Japanese business letters. The structure is formal, with clear sections for each part of the invoice.
I should check if there's anything else important. The company addresses are in Japan, both Tokyo and Osaka. The document is dated August 20, 2024. Payment terms are 30 days, which is standard. Transportation costs are handled by the requester, so the supplier doesn't cover that. Contact info includes both companies' addresses and contact details.
Overall, the document is a formal request for purchase, detailing items, prices, taxes, and payment instructions. It's a standard business communication in Japan, using proper terminology and layout.

</think>
この画像は、日本の商社間での請求書請求書（注文書/取引書）樣式に基づいた文書です。以下詳細を分解します。

 1. 文書の構成要素

 上部情報
タイトル

御請求書（請求書/注文書の請求書）

請求書番号: INV-2024-0820（2024年8月20日の請求）
発行元情報
公司名: 模範商事株式会社之印（「之印」は商号の一部、または印鑑の名称）
住所: 東京都千代田区見本町1-1
電話/フロント: 03-1234-5678
ファックス: 03-1234-5679
宛先情報
受取元: 範例工業株式会社御中（「御中」は「貴社宛て」の敬語表現）

 2. 請求内容の詳細

 商品・サービス一覧


項目
数量
単価
金額


特選和紙（A4サイズ）
1000
¥50
¥50,000

高級墨（松煙）
20
¥2,000
¥40,000

筆セット（各種）
50
¥1,000
¥50,000


小計

合計金額: ¥140,000（税抜）

 消費税（消費税10%）
¥140,000 × 10% = ¥14,000

合計金額（税込）

¥140,000 + ¥14,000 = ¥154,000

 3. 備考・条件

支払い条件

請求書発行日より30日以内にご振込お願い。

運賃負担

貨物の運送費は発行元（模範商事株式会社）が負担。

問い合わせ先

貨物に関するお問い合わせは受取元（範例工業株式会社）に連絡ください。

 4. 発行責任者
担当部署: 販売部
担当事務: 見本太郎

 5. 文書の形式

官式スタイル
表格・項目別記載・税金欄の明確化
公式印（之印）の配置
言語: 完全日本語（ビジネス敬語使用）


印刷例の特徴
数量・価格の見易しさ
税金計算欄の区切（税込/税込）


 6. 各種注意点

金額の計算方法
特選和紙: 1000枚 × ¥50 = ¥50,000
高級墨: 20本 × ¥2,000 = ¥40,000
筆セット: 50セット × ¥1,000 = ¥50,000
税込金額: ¥140,000 × 1.10 = ¥154,000


支払い期限の意味
商業信用の基準（通常30日以内）


 7. 文書の用途
請求書請求書（注文書/取引書）として、受取元に商品の注文を申し込む際使用
貯金・財務管理のための明瞭な金額・条件記載
商社間の業務連絡の形式としての標準化

 8. 画像全体の特徴

日本企業の文書スタイル
正書・形式的表現
公式印（之印）の使用
明確な責任分担と条件記載


ビジネスコミュニケーションの文脈
対象は商社間の取引（模範商事株式会社←→範例工業株式会社）
商品の品質や運賃負担が明示

この文書は、日本の商社間取引における標準的な請求書の形を示し、商品の詳細、金額計算、条件・条件の明確な記載が特徴です。
んー、なんだろう、日本のや日本語について、わかってるようなわかってないような、読めてるような読めてないような・・・。最初の2つの結果からは日本については弱いのかな？と思いきや、最後のやつは結構きちんと読めている。ちょっとよくわからないな・・・。画像の質とかにも影響されてるのかもしれない。
2025/08/22追記
改めてInternVL3を試した記事を見直してみたけども、元々日本の知識は弱い可能性はありそう。

項目	数量	単価	金額
特選和紙（A4サイズ）	1000	¥50	¥50,000
高級墨（松煙）	20	¥2,000	¥40,000
筆セット（各種）	50	¥1,000	¥50,000

kun432

動画は、ハードウェア要件的に厳しそう・・・と思いつつ、一応L4でやってみたらあえなくCUDA Out of CUDA out of memory。

せめてA100にしないと厳しそうなのだが、ちょうどA100が空いてなかったのでスキップ。

kun432

あと今回は試さないのだが、以下にも対応している。

Function Callingにも対応している
Thinkingモードの有効・無効が設定できる

kun432

ところで、llama.cppのPRを見ると、一応PRは上がってるんだけど、まだマージされていないように見える。GGUF版は果たして動くのかな？

まだPRがマージされていないのだけども、llama.cppでも試してみる。環境はUbuntu-22.04 + RTX4090。

llama.cpp をビルド

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16

モデルをダウンロードする。なお、用意されているバリエーションは Q8_0 と f16 のみなので、今回はモデルのレポジトリごとダウンロードした。

git clone https://huggingface.co/internlm/Intern-S1-mini-GGUF

モデルカードにはCLIのサンプルがあるけども、llama.cppのマルチモーダル用CLIでシステムプロンプトを反映する方法がちょっとよくわからなかった（通常のCLIとはやや異なる様子）ので、サーバモードで試してみる。f16 の方を使った。

./build/bin/llama-server \
    --model Intern-S1-mini-GGUF/f16/Intern-S1-mini-f16.gguf \
    --mmproj Intern-S1-mini-GGUF/f16/mmproj-Intern-S1-mini-f16.gguf \
    --gpu-layers 100 \
    --temp 0.8 \
    --top-p 0.8 \
    --top-k 50 \
    --host 0.0.0.0 \
    --port 8080 \
    --seed 1024

GUIでアクセスして設定から、モデルカードにあるシステムプロンプトをセット。

You are an expert reasoner with extensive experience in all areas. You approach problems through systematic thinking and rigorous reasoning. Your response should reflect deep understanding and precise logical thinking, making your solution path and reasoning clear to others. Please put your thinking process within <think>...</think> tags.

こんな感じで推論できる。

f16でGPUフルオフロードだとこんな感じ

出力

prompt eval time =     214.15 ms /   336 tokens (    0.64 ms per token,  1569.00 tokens per second)
       eval time =   17240.10 ms /   932 tokens (   18.50 ms per token,    54.06 tokens per second)
      total time =   17454.25 ms /  1268 tokens

出力

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0 Off |                  Off |
|  0%   48C    P8             11W /  450W |   16742MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

kun432

 まとめ日本語は普通に使えるし日本語文字の認識精度もまあまあ高そう。このあたりはInternVL3と同じなんだけど、日本の知識が少なそうというところもなんとなく受け継いでいそう。
モデルカードにあるように


 特徴
言語および視覚推論ベンチマーク全般で高い性能を発揮し、特に科学タスクに強みを持つ。
5兆トークンの大規模データセットで継続的に事前学習され、その50%以上が専門的な科学データであり、深いドメイン知識を内包。
動的トークナイザーにより、分子式やタンパク質配列をネイティブに理解可能。

というところが強みなようなので、今回自分が試した内容だと正しい評価はできない可能性はあるので、ユースケースに合わせて検証する必要はある。

このスクラップは4ヶ月前にクローズされました