Qwen3（30B-A3B-Instruct-2507・30B-A3B-Thinking-2507・Coder-30B-A3B-Instruct）をllama.cppで試す（Unsloth GGUF）

https://x.com/UnslothAI/status/1950935044387684478
モデル
Instruct（no thinking）

https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF
Thinking

https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF
Coder

https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Unslothのドキュメント

https://docs.unsloth.ai/basics/qwen3-2507
https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally

kun432

環境は Ubuntu-22.04 / CUDA-12.4 / RTX4090。

llama.cppの最新をビルド

git clone https://github.com/ggml-org/llama.cpp llma.cpp && cd $_
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16

kun432

Qwen3-30B-A3B-Instruct-2507-GGUF

ドキュメントによると、パラメータの推奨値は以下。

⚙️ベストプラクティス

最適なパフォーマンスを実現するため、Qwenは以下の設定を推奨します:

temperature=0.7, top_p=0.8, top_k=20, min_p=0.0を設定し、フレームワークがpresence_penaltyに対応している場合は0〜2の間で設定して、無限ループを回避することを推奨します。

temperature = 0.7

top_k = 20

min_p = 0.00（llama.cppのデフォルトは0.1）

top_p = 0.80

presence_penalty = 0.0～2.0（llama.cppのデフォルトはオフですが、繰り返しを削減するにはこの設定を使用できます。例として1.0を試してみてください）

ネイティブで最大262,144コンテキストをサポートしますが、RAM使用量を減らすために32,768トークンに設定することも可能です。

チャットテンプレートは一旦考慮せずに試す。量子化はUD-Q4_K_XLを使う。

./build/bin/llama-cli \
    -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_XL \
    --jinja \
    -ngl 99 \
    --threads -1 \
    --ctx-size 32684 \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.80 \
    --top-k 20 \
    --presence-penalty 1.0

VRAM24GBに収まったみたい。

出力

Fri Aug  1 17:04:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   49C    P0             51W /  450W |   22533MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

推論

> 競馬の楽しみ方について5つリストアップして。

出力

もちろん！競馬の楽しみ方は人それぞれですが、以下に5つの代表的な楽しみ方をリストアップしました：

1. **レースの予想・分析**
　過去の成績、馬の調教状態、騎手の実力、コースの特性などを調べて、どの馬が勝つのかを予想する。データや直感を駆使して予想を立てる過程自体が大きな楽しみです。

2. **勝ち馬の“見逃し”を防ぐ瞬間**
　レースの最終コーナーで一瞬の展開が決まる瞬間、期待と緊張の中で「この馬が抜け出す！」という瞬間を体験する醍醐味。特に人気馬が逆転勝利するときの驚きは特別です。

3. **馬や騎手との“つながり”を感じる**
　好きな馬や騎手を応援し、彼らの活躍を追うことで、感情移入しながらレースを楽しむ。特に「自分の選んだ馬が勝った！」という達成感は大きいです。

4. **競馬場での雰囲気・イベント体験**
　競馬場の観客席での声援、開催中のステージイベント、グルメやお土産など、レース以外の楽しみも豊富。家族や友人と一緒に楽しむのも一興です。

5. **ギャンブルとしてのリスクとリターンの面白さ**
　賭け（馬券）をすることで、勝敗の結果に直接お金が関わるため、緊張感や喜び・悔しさがより強く感じられます。ただし、適度な範囲で楽しむことが大切です。

これらの楽しみ方を組み合わせて、自分だけの競馬ライフを築くのもおすすめです！

出力

llama_perf_sampler_print:    sampling time =     209.75 ms /   427 runs   (    0.49 ms per token,  2035.78 tokens per second)
llama_perf_context_print:        load time =    1944.88 ms
llama_perf_context_print: prompt eval time =     130.78 ms /    20 tokens (    6.54 ms per token,   152.93 tokens per second)
llama_perf_context_print:        eval time =    2501.15 ms /   406 runs   (    6.16 ms per token,   162.33 tokens per second)
llama_perf_context_print:       total time =  109593.11 ms /   426 tokens
llama_perf_context_print:    graphs reused =          0

kun432

Qwen3-30B-A3B-Thinking-2507

Thinkingのパラメータの推奨値は以下。Instructとはtemperatureとtop_pが多少異なる。

⚙️ベストプラクティス

最適なパフォーマンスを実現するため、Qwenは以下の設定を推奨します:

temperature=0.6, top_p=0.95, top_k=20, min_p=0.0を設定し、フレームワークがpresence_penaltyに対応している場合は0〜2の間で設定して、無限ループを回避することを推奨します。

temperature = 0.6

top_k = 20

min_p = 0.00（llama.cppのデフォルトは0.1）

top_p = 0.95

presence_penalty = 0.0～2.0（llama.cppのデフォルトはオフですが、繰り返しを削減するにはこの設定を使用できます。例として1.0を試してみてください）

ネイティブで最大262,144コンテキストをサポートしますが、RAM使用量を減らすために32,768トークンに設定することも可能です。

./build/bin/llama-cli \
    -hf unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_XL \
    --jinja \
    -ngl 99 \
    --threads -1 \
    --ctx-size 32684 \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    --presence-penalty 1.0

> 競馬の楽しみ方について5つリストアップして。

出力

<think>
Okay, the user is asking for a list of five ways to enjoy horse racing. Let me think about this. First, I need to recall common aspects of horse racing that people find enjoyable. Maybe start with betting since that's a big part of it.

Wait, but not everyone bets, so maybe include both betting and non-betting activities. Let me brainstorm. Watching races is obvious. Then maybe studying the horses and jockeys. Oh, there's also the social aspect—going to the track with friends.

Hmm, what else? Maybe learning about the history or the breeding of horses. Or following specific horses over time. Wait, the user wants five points. Let me check if I have enough.

1. Betting on races – but need to mention responsible gambling.
2. Watching live races at the track or on TV.
3. Studying horse profiles, jockeys, and past performances.
4. Socializing with other fans at racecourses.
5. Following a particular horse's career over multiple races.

Wait, is that five? Let me count. Yeah, those are five. But maybe I should make sure each point is distinct. Let me verify. Betting, watching, studying, socializing, following a horse. Yep, that's five.

Wait, but some might argue that studying and following a horse are similar. Maybe combine them? No, studying could be about the data, while following a horse is more emotional investment. So they're different.

Also, need to make sure the list is in Japanese. The user asked for the answer in Japanese, so I should present the list in Japanese. But the thinking is in English. Let me confirm each point.

Another thought: maybe include attending events at the racetrack, like races with food and entertainment. But that might overlap with socializing. Hmm. Maybe the fifth point can be about enjoying the atmosphere and events at the track, not just the racing itself.

Wait, the user wants five points. Let me stick to the initial list. Let's check online if there are common ways people enjoy horse racing. From what I know, betting, watching live, analyzing, socializing, and building a connection with a horse are common. So that's probably accurate.

Make sure each point is clear and concise. Avoid technical terms so it's accessible. Also, mention responsible gambling for the betting point to be safe.

Okay, putting it all together in Japanese. Let me translate each point.

1. ベッティング（賭け）を楽しみながら、責任あるギャンブルを心がける
2. レースの実況や映像を観戦し、瞬間の緊張感を体感する
3. 馬や騎手のデータ、過去の成績を分析して戦略を立てて楽しむ
4. 賭博場で友人や家族と交流を深め、雰囲気を楽しむ
5. 特定の馬の成長や成績を追いかけて、長期的な関係を築く

Wait, "賭博場" is a bit off. It should be "競馬場" (racecourse). So correct that.

Also, for point 3, maybe "過去の成績を分析して戦略を立てて楽しむ" – but some people might not bet, so maybe say "分析して知識を深め、楽しみ方の一つとして" instead of "戦略を立てて".

Hmm. Let me adjust that. Maybe "馬や騎手のデータ、過去の成績を学び、競馬の奥深さを楽しむ"

Yes, that's better. So point 3 is about learning and appreciating the depth.

Point 4: "競馬場で友人や家族と交流を深め、雰囲気を楽しむ" – yes, that's good.

Point 5: Following a specific horse's career.

Double-checking for five points. Yes. Now, need to present them in Japanese with numbers. Make sure the terms are correct.

Also, avoid any negative connotations. Since betting can be problematic, mentioning responsible gambling is good.

Okay, final list should be:

1. ベッティング（賭け）を楽しみつつ、責任あるギャンブルを心がける
2. レースの実況や映像を観戦し、瞬間の緊張感を体感する
3. 馬や騎手のデータ、過去の成績を分析して競馬の奥深さを楽しむ
4. 競馬場で友人や家族と交流を深め、会場の雰囲気を楽しむ
5. 特定の馬の成長や成績を追いかけて、長期的な関わりを楽しむ

Wait, "実況や映像" – maybe "実況中継やライブ映像" but "実況" is short for "実況中継". Maybe just "実況や映像" is okay.

Alternatively, "テレビやネットでの実況観戦" but the user asked for five points, so keep it concise.

I think this list works. Let me check if each point is unique and covers different aspects. Yes: betting (with responsibility), watching, analysis, social, following a horse. Good.

Now, write them in Japanese as a numbered list.
</think>

1. ベッティング（賭け）を楽しみつつ、責任あるギャンブルを心がける
2. レースの実況やライブ映像を観戦し、瞬間の緊張感と躍動感を体感する
3. 馬や騎手のデータ、過去の成績を分析して競馬の戦略や深みを楽しむ
4. 競馬場で友人や家族と交流を深め、会場の雰囲気やイベントを存分に楽しむ
5. 特定の馬の成長や成績を追いかけて、長期的な関わりを通じて感情移入して楽しむ

出力

llama_perf_sampler_print:    sampling time =     715.71 ms /  1317 runs   (    0.54 ms per token,  1840.14 tokens per second)
llama_perf_context_print:        load time =    1894.25 ms
llama_perf_context_print: prompt eval time =     131.75 ms /    20 tokens (    6.59 ms per token,   151.80 tokens per second)
llama_perf_context_print:        eval time =    8394.49 ms /  1296 runs   (    6.48 ms per token,   154.39 tokens per second)
llama_perf_context_print:       total time =  746969.63 ms /  1316 tokens
llama_perf_context_print:    graphs reused =          0

kun432

Qwen3-Coder-30B-A3B-Instruct

こちらも推奨の設定は以下。

⚙️ 推奨設定

Qwenは、両方のモデルに対して以下の推論設定を推奨します:

temperature=0.7、top_p=0.8、top_k=20、repetition_penalty=1.05

Temperature: 0.7

Top_K: 20

Min_P: 0.00 （オプション。0.01で上手く動作し、llama.cppのデフォルトは0.1です）

Top_P: .8

Repetition Penalty: 1.05

推奨コンテキスト出力: 65,536 トークン（増加可能）。

./build/bin/lama-cli \
    -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
    --jinja \
    -ngl 99 \
    --threads -1 \
    --ctx-size 32684 \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.80 \
    --top-k 20 \
    --repeat-penalty 1.05

> ターミナルで動く三目並べを作って。

出力

以下は、ターミナル上で動作する三人のプレイヤーが参加できる三目並べ（ Tic Tac Toe ）ゲームです。

このプログラムでは：

- 3x3 の盤面を使用します。
- プレイヤー1 は 'X'、プレイヤー2 は 'O' を使用します。
- 入力は盤面の番号（1〜9）で行います。例えば、1 は左上のマスです。
- ゲームが終了するまで盤面を表示しながらプレイヤーを交互に処理します。

---

### 🧠 プログラムコード

```python
def print_board(board):
    print("\n")
    for i in range(3):
        print(f" {board[i][0]} | {board[i][1]} | {board[i][2]} ")
        if i < 2:
            print("-----------")
    print("\n")

def check_win(board, player):
    # 行のチェック
    for row in board:
        if all([cell == player for cell in row]):
            return True
    # 列のチェック
    for col in range(3):
        if all([board[row][col] == player for row in range(3)]):
            return True
    # 対角線のチェック
    if board[0][0] == board[1][1] == board[2][2] == player:
        return True
    if board[0][2] == board[1][1] == board[2][0] == player:
        return True
    return False

def is_board_full(board):
    return all(cell != ' ' for row in board for cell in row)

def get_move():
    while True:
        try:
            move = int(input("マスの番号を入力してください (1-9): "))
            if 1 <= move <= 9:
                return move
            else:
                print("無効な入力です。1〜9の間で入力してください。")
        except ValueError:
            print("数字を入力してください。")

def main():
    board = [[' ' for _ in range(3)] for _ in range(3)]
    current_player = 'X'

    print("三目並べを始めます！")
    print("マスの番号は以下の通りです：")
    print(" 1 | 2 | 3 ")
    print("-----------")
    print(" 4 | 5 | 6 ")
    print("-----------")
    print(" 7 | 8 | 9 ")

    while True:
        print_board(board)
        print(f"プレイヤー {current_player} の番です。")

        move = get_move()
        row = (move - 1) // 3
        col = (move - 1) % 3

        if board[row][col] == ' ':
            board[row][col] = current_player
        else:
            print("そのマスは既に埋まっています。もう一度入力してください。")
            continue

        if check_win(board, current_player):
            print_board(board)
            print(f"🎉 プレイヤー {current_player} の勝利です！")
            break

        if is_board_full(board):
            print_board(board)
            print("引き分けです！")
            break

        # プレイヤーを切り替え
        current_player = 'O' if current_player == 'X' else 'X'

if __name__ == "__main__":
    main()
```

---

### 🔧 実行方法

ターミナルで以下のコマンドを実行してください：

```bash
python tictactoe.py
```

（ファイル名は `tictactoe.py` として保存）

---

### 📌 プレイ方法

1. 盤面の番号が表示されます。
2. 番号を入力してマスを選択します（例：1〜9）。
3. 交互に 'X' と 'O' が配置されていきます。
4. 一度に3つ並んだら勝ちです。

---

ご希望があれば、AI対戦モードやゲームの拡張機能（リプレイ機能など）も追加できます！

出力

llama_perf_sampler_print:    sampling time =     402.34 ms /   946 runs   (    0.43 ms per token,  2351.25 tokens per second)
llama_perf_context_print:        load time =    1876.50 ms
llama_perf_context_print: prompt eval time =     133.47 ms /    21 tokens (    6.36 ms per token,   157.34 tokens per second)
llama_perf_context_print:        eval time =    5813.73 ms /   924 runs   (    6.29 ms per token,   158.93 tokens per second)
llama_perf_context_print:       total time =  366513.32 ms /   945 tokens
llama_perf_context_print:    graphs reused =          0

kun432

 まとめとりあえずサイズも小さいので、480B-A3Bよりも動かしやすい。日本語もなんの問題もなく使えそうで、良さそう。
Qwen-Codeで試してみたい。

kun432

CoderモデルのGGUFには色々問題があったようでQwen-Codeで上手く動かなかったのだが、どうやら直った模様？（まだ確認できていない）
https://zenn.dev/kun432/scraps/4281edbb8a5e98
CoderモデルのGGUFは全部更新されているようので、再度ダウンロードし直したほうが良さそう。

このスクラップは3ヶ月前にクローズされました