Qwen3を試す

https://x.com/Alibaba_Qwen/status/1916962087676612998
Qwen3 をご紹介します！
当社は、2つのMoEモデルと6つのdenseモデルを含む、0.6Bから235Bまでの最新の大規模言語モデル「Qwen3」をリリースし、オープンウェイトで提供開始いたします。フラッグシップモデルであるQwen3-235B-A22Bは、DeepSeek-R1、o1、o3-mini、Grok-3、Gemini-2.5-Proなどの最先端モデルと比較して、コーディング、数学、汎用能力などのベンチマーク評価で競争力のある結果を達成しています。さらに、小型のMoEモデルであるQwen3-30B-A3Bは、アクティブパラメーターが10倍のQwQ-32Bを凌駕し、Qwen3-4Bのような小さなモデルでもQwen2.5-72B-Instructの性能に匹敵します。
詳細については、Qwen Chat Web（https://chat.qwen.ai）およびアプリで試してみてください。また、当社のGitHub、HF、ModelScopeなどもご覧ください。
ブログ：https://qwenlm.github.io/blog/qwen3/

GitHub：https://github.com/QwenLM/Qwen3

Hugging Face：https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f

ModelScope：https://modelscope.cn/collections/Qwen3-9743180bdc6b48
Qwen3-30B-A3B などのポストトレーニング済みモデルとその事前トレーニング済みモデル（例：Qwen3-30B-A3B-Base）は、Hugging Face、ModelScope、Kaggle などのプラットフォームで利用可能です。デプロイメントには、SGLang や vLLM などのフレームワークの使用を推奨します。ローカルでの使用には、Ollama、LMStudio、MLX、llama.cpp、KTransformers などのツールを強く推奨します。これらのオプションを使用することで、ユーザーは研究、開発、生産環境のいずれにおいても、Qwen3 をワークフローに簡単に統合することができます。
新しいモデルをお楽しみください！
https://x.com/Alibaba_Qwen/status/1916962091925442698
Qwen3 は、割り当てられた計算推論予算と直接相関する、スケーラブルでスムーズなパフォーマンスの向上を実現しています。この設計により、ユーザーはタスク固有の予算をより簡単に設定でき、コスト効率と推論品質のより最適なバランスを実現できます。
https://x.com/Alibaba_Qwen/status/1916962096346202468
Qwen3 モデルは 119 の言語と方言に対応しています。この広範な多言語対応により、国際的なアプリケーションの新たな可能性が開かれ、世界中のユーザーがこれらのモデルのパワーを活用できるようになります。
https://x.com/Alibaba_Qwen/status/1916962100817367192
Qwen3 モデルをコーディングおよびエージェント機能向けに最適化し、MCP のサポートも強化しました。以下では、Qwen3 がどのように考え、環境と相互作用するかを例を挙げてご紹介します。
https://x.com/Alibaba_Qwen/status/1917064282552078480
また、オープンソースのコーディングエージェント「Openhands」で Qwen3-235B-A22B の予備的なパフォーマンスを評価しました。Swebench 検証で 34.4% のスコアを達成し、より少ないパラメータで競争力のある結果を得ることができました！使いやすいエージェントを提供してくださった

@allhands_ai に感謝いたします。オープンモデルとオープンエージェント、どちらも非常に興味深いですね！

kun432

今ならnpaka先生のまとめを見るほうがわかりやすい

https://note.com/npaka/n/n43abd5843fe7

kun432

量子化モデルも
https://x.com/Alibaba_Qwen/status/1918353505074725363
Qwen3 の量子化モデルを近日中に公開いたします。本日、Qwen3-14B および Qwen3-32B の AWQ および GGUF を公開いたします。これにより、GPU メモリが限られている環境でもモデルをご利用いただけます。
Qwen3-32B-AWQ： https://huggingface.co/Qwen/Qwen3-32B-AWQ

Qwen3-32B-GGUF： https://huggingface.co/Qwen/Qwen3-32B-GGUF

Qwen3-14B-AWQ： https://huggingface.co/Qwen/Qwen3-14B-AWQ

Qwen3-14B-GGUF： https://huggingface.co/Qwen/Qwen3-14B-GGUF
Ollama および LMStudio で GGUF を使用する場合、思考モードから非思考モードに切り替えるには、入力の最後に特別なトークン /no_think を追加するだけです。以下に例を示します。
お楽しみください！

kun432

モデルが多い
https://huggingface.co/Qwen/Qwen3-0.6B
https://huggingface.co/Qwen/Qwen3-1.7B
https://huggingface.co/Qwen/Qwen3-4B
https://huggingface.co/Qwen/Qwen3-8B
https://huggingface.co/Qwen/Qwen3-14B
https://huggingface.co/Qwen/Qwen3-32B
以下はMoEモデル
https://huggingface.co/Qwen/Qwen3-30B-A3B
https://huggingface.co/Qwen/Qwen3-235B-A22B

kun432

Unslothからも出てる

kun432

たくさんあるので、日常でつかそうな8Bを試してみる。Colaboratory T4でギリギリ収まりそう。

transformersのバージョンは>4.51.0である必要がある

!pip install -U transformers
!pip freeze | grep -i transformers

出力

transformers==4.51.3

モデルとトークナイザーのロード

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

モデルロード時点でのVRAM消費は12.3GBぐらい

出力

Sat May  3 01:26:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   46C    P0             26W /   70W |   12332MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推論。enable_thinkingで思考モードの有効・無効が設定できる。デフォルトは有効。

# モデルへの入力
prompt = "大規模言語モデルについて簡単に説明してください。"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # 思考モードと非思考モードを切り替え、デフォルトは True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# テキスト補完の実行
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# 思考コンテキストのパース
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

11分ほどで回答が生成された

出力

thinking content: <think>
Okay, the user wants a simple explanation of large language models. Let me start by defining what they are. They're AI systems trained on vast amounts of text data. I should mention that they can understand and generate human-like text. Maybe give examples like chatbots or translation tools. I need to explain the training process briefly—using huge datasets and deep learning. Also, highlight their capabilities: answering questions, creating content, coding. But I should keep it easy to understand, avoiding jargon. Maybe mention applications in various fields. Wait, should I touch on the technology behind them, like neural networks? Maybe just a little. Also, note that they have limitations, like potential biases or inability to handle complex tasks. Keep the explanation concise but comprehensive. Let me structure it step by step: definition, training process, capabilities, applications, and limitations. Make sure it's clear and not too technical. Alright, that should cover it.
</think>
content: 大規模言語モデル（Large Language Model、LLM）は、膨大な文章データを学習して、自然な言語（日本語、英語など）を理解し、生成できる人工知能（AI）です。以下に簡単に説明します。

### 1. **特徴**
- **多言語対応**：多くのモデルは複数の言語を扱えます（例：日本語、英語、中国語など）。
- **上下文理解**：文章の前後の文脈を考慮して、意味を正確に把握します。
- **生成能力**：質問への回答、文章の作成、コードの生成、翻訳など、さまざまなタスクに応じた文章を生成できます。

### 2. **学習方法**
- **大量のデータ**：インターネット上の文章、書籍、論文など、膨大なテキストデータを学習します。
- **深層学習**：ニューラルネットワーク（特にTransformerアーキテクチャ）を用いて、言語のパターンや構造を学習します。

### 3. **用途例**
- **チャットボット**：ユーザーと自然な会話をできるAI。
- **質問応答**：特定の分野（科学、歴史、技術など）の知識を活用した回答。
- **文章生成**：ブログ記事、メール、小説などの作成支援。
- **翻訳**：複数の言語間での翻訳。
- **プログラミング**：コードの作成や修正のサポート。

### 4. **限界**
- **バイアスの可能性**：学習データに含まれる偏見や誤情報が反映されることがあります。
- **論理的推論の限界**：複雑な論理的思考や現実の知識を完全に理解できない場合があります。
- **計算リソース**：大規模なモデルは高性能なコンピュータやクラウドサービスが必要です。

### 5. **代表的なモデル**
- **GPTシリーズ**（OpenAI）
- **BERT**（Google）
- **LLaMA**（Meta）
- **Qwen**（通義千問）など

大規模言語モデルは、AI技術の進歩により、日常生活やビジネス、教育などさまざまな分野で活用されていますが、適切な利用と倫理的な配慮が求められます。

推論後のVRAM消費は14.7GB。ギリギリ。

出力

Sat May  3 01:51:57 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   72C    P0             41W /   70W |   14758MiB /  15360MiB |     39%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

kun432

推論。enable_thinking で思考モードの有効・無効が設定できる。デフォルトは有効。

モデルカードにもあるが、enable_thinkingはハードスイッチで、これを有効にした場合でもプロンプトで無効化できる。/no_thinkをシステムプロンプトもしくはユーザプロンプトに追加する。

prompt = "大規模言語モデルについて簡単に説明してください。"
messages = [
    {"role": "user", "content": prompt + " /no_think"}     # `/no_think`を追加
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

4分程で回答が生成された。思考モードの出力がないことから、無効化されているのがわかる。

出力

thinking content: <think>

</think>
content: 大規模言語モデル（Large Language Model、LLM）は、膨大な量のテキストデータを学習することで、自然言語を理解し、生成する能力を持つ人工知能の一種です。これらのモデルは、文章の意味を理解し、質問に答える、文章を翻訳する、文章を要約する、新しい文章を生成するなど、さまざまなタスクに応じて活用されます。

主な特徴には以下のものがあります：

- **大規模なパラメータ数**：数億から数千億ものパラメータを持つことで、複雑なパターンを学習できます。
- **多言語対応**：多くのモデルは複数の言語をサポートしており、国際的な利用が可能です。
- **生成能力**：ユーザーの指示に応じて、文章やコード、詩、ストーリーなどを生成できます。
- **応用範囲が広い**：チャットボット、翻訳、要約、質問応答、プログラミング支援など、さまざまな分野で利用されています。

代表的な大規模言語モデルには、GPT（OpenAI）、BERT（Google）、LLaMA（Meta）などがあります。

VRAM消費も思考モードに比較すると低い

出力

Sat May  3 02:10:53 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   72C    P0             40W /   70W |   13668MiB /  15360MiB |     39%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

一応明示的に思考モードを無効化も試してみたが、生成速度やVRAM消費量が下がるということはなかった（試行回数が少なすぎて参考にならないけど、自分が試した際は逆に増えたりしてた）。

であれば、プロンプトで制御するほうが使い勝手は高そう。

kun432

量子化版使えばもうちょっと速くなるかな？あとは手元で動かすなら4Bぐらいでも良さそう。

あとコンテキストサイズは32,768だけど、4B以降はRoPEスケーリングで131,072まで対応している。

kun432

とりあえずmlx-lmでも少し触ってみた。8Bの4ビット量子化バージョンで。

uv init -p 3.12.9 mlx-lm-work && cd mlx-lm-work
uv add mlx-lm
uv run mlx_lm.generate --model "mlx-community/Qwen3-8B-4bit" --max-tokens 8192 --prompt "競馬の魅力を5つリストアップして。"

出力

Fetching 9 files: 100%|████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 75047.19it/s]
==========
<think>
Okay, the user wants me to list five reasons why horse racing is appealing. Let me start by recalling what I know about horse racing. First, the excitement of the race itself. The unpredictability of the outcome is a big part of it. Then there's the aspect of strategy and training. Horse racing isn't just about luck; it's also about the skill of the jockey and the trainer.

Wait, maybe I should think about the emotional connection. People often form bonds with the horses, which adds a personal touch. Also, the history and tradition of horse racing. It's been around for centuries, so there's a rich cultural aspect.

Oh, and the entertainment value. The atmosphere at the races, the crowd, the betting aspect. That could be another point. Let me check if I have five. Excitement, strategy, emotional connection, tradition, and entertainment. That seems to cover it. I should make sure each point is distinct and highlights different aspects of the sport. Maybe also mention the community aspect or the thrill of the competition. Yeah, that should work.
</think>

競馬の魅力を以下のように5つリストアップできます：

1. **予測不可能な競争の熱狂**
   順位が事前に決まらないため、どの馬が勝つか最後まで分からない。その不確実性が緊張感とドキドキを生み、観戦の醍醐味を高めます。

2. **人馬一体のドラマ**
   馬と騎手の絆、トレーナーの努力が結果に直結。馬の性格や走り方、騎手の判断が競争をより人間味のあるドラマへと変える点が魅力です。

3. **歴史と文化的な深さ**
   世界中で長年にわたって続く伝統的な競技。各地の文化や風習に根ざした独自の魅力があり、歴史を知る楽しみもあります。

4.ドラマチックな瞬間の美しさ
   距離の差が縮まって最後の直線で競争が一変する瞬間、または馬の走りの美しさは、観戦者を魅了します。

5. **賭け事としての刺激と戦略性**
   ポイントを取るための戦略や、馬の成績・環境の分析が求められ、競馬を楽しむための知的な側面も存在します。
==========
Prompt: 19 tokens, 29.444 tokens-per-sec
Generation: 539 tokens, 38.091 tokens-per-sec
Peak memory: 4.784 GB

いい感じの生成速度だと思うので、これなら日常使いも十分出来る。

このスクラップは5ヶ月前にクローズされました