iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📊

Qwen3 Inference Speed and Shaberi3 Benchmark Results

に公開

2025-05-03 Update

I have conducted the Shaberi3 benchmark for Qwen3-235B-A22B, so I'm adding the results here.

Inference Speed

I have also added the inference speed [tokens/s] for Qwen3-235B-A22B to the Speed section.

Model prompt eval rate eval rate
Qwen3-235B-A22B:UD-Q3_K_XL_think 56.08 16.60

The eval rate is reasonably fast, but the slow prompt eval rate is a concern.

For example, when using it as a coding agent, inputs of several thousand tokens are common, so it's inevitable that there will be a significant wait from the time the prompt is entered until generation starts.

That said, it seems useful for the following purposes:

  • Single-turn processing
  • Coding with auto-approval
  • Automated batch processing

Shaberi3 Benchmark Results

Notes

At the time of this update, gemini-2.0-flash-exp, which was originally available 1,500 times a day, had been limited to 500 times a day.

https://x.com/gosrum/status/1918471127351099826

Since it can only be used 500 times a day, I judged that I could get a more accurate comparison by evaluating with the higher-performance gemini-2.5-flash-preview-04-17, so I decided to use gemini-2.5-flash-preview-04-17 for the evaluation.

Here are the results.

Model Weighted Mean ELYZA-tasks-100 Japanese MT-Bench Tengu-Bench
gemini-2.5-flash-preview-04-17 9.2 9.3 9.7 8.8
grok-3-mini-beta 9.0 9.3 9.7 8.5
DeepSeek-R1 8.9 9.0 9.5 8.7
gpt-4.1-2025-04-14 8.9 9.1 9.5 8.4
★Qwen3-235B-A22B:UD-Q3_K_XL-think 8.8 8.9 9.6 8.5
★qwen3-235b-a22b:free 8.8 9.0 9.1 8.5
claude-3-7-sonnet-20250219 8.7 9.1 9.5 7.9
DeepSeek-V3-0324 8.6 8.9 9.0 8.1
★Qwen3-235B-A22B:UD-Q3_K_XL_no_think 8.5 8.7 9.4 7.8
★Qwen3-32B:UD-Q4_K_XL_think 8.4 8.6 9.5 7.6
DeepSeek-R1-UD-IQ1_S 8.2 8.5 8.7 7.8
★Qwen3-30B-A3B:UD-Q4_K_XL-think 8.0 7.6 9.3 7.6

Notable Mentions

Qwen3-235B-A22B

To confirm the impact of quantization, I evaluated both qwen3-235b-a22b:free on OpenRouter and Qwen3-235B-A22B:UD-Q3_K_XL.

As a result, surprisingly, Qwen3-235B-A22B:UD-Q3_K_XL achieved a higher score!

From this fact, the following possibilities can be considered:

  • Unsloth's UD-Q3_K_XL is quantized with almost no degradation.
  • OpenRouter's qwen3-235b-a22b:free is performing some kind of quantization (e.g., Q4_K_M).

In any case, it is noteworthy that Qwen3-235B-A22B:UD-Q3_K_XL-think surpassed DeepSeek-V3-0324 and achieved a score comparable to DeepSeek-R1-0324.

Comparison between Qwen3-235B-A22B:UD-Q3_K_XL and DeepSeek-R1-UD-IQ1_S

Even with DeepSeek-R1-UD-IQ1_S quantized to 1.58bit, it could not be run with less than 128GB of VRAM, but Qwen3-235B-A22B:UD-Q3_K_XL-think can be run on about 120GB of VRAM.

Furthermore, it achieved a higher score than DeepSeek-R1 quantized to 1.58bit, which demonstrates the high performance of Qwen3-235B-A22B:UD-Q3_K_XL.

Impact of reasoning

Even for Qwen3-235B-A22B:UD-Q3_K_XL, the score with reasoning is clearly higher than without.

That said, even without reasoning, the score is higher than Qwen3-32B:UD-Q4_K_XL_think.

Qwen3 gives the impression of "thinking" for quite a long time, so if you have sufficient VRAM, Qwen3-235B-A22B:UD-Q3_K_XL_no_think might be a better choice in terms of both performance and speed.


Introduction

Who is this article for?

  • Those interested in the Japanese language performance of generative AI
  • Those interested in Qwen3
Environment
Mac Studio (M2 Ultra 128GB)

Overview

I evaluate LLM using the Shaberi3 benchmark as a hobby.

I usually post my results on X, but since Qwen3 has so much to cover, I decided to summarize it in a short article here.

Evaluated Models and Conditions

There was no particular reason, but this time I ran and evaluated the following GGUF models published by Unsloth using Ollama.

https://huggingface.co/collections/unsloth/qwen3-680edabfb790c8c34a242f95

Parameter Sizes and Quantization Types Evaluated This Time

  • 30B-A3B:UD-Q4_K_XL
  • 32B:UD-Q4_K_XL
  • 14B:UD-Q4_K_XL
  • 8B:UD-Q4_K_XL
  • 4B:UD-Q4_K_XL
  • 1.7B:UD-Q4_K_XL
  • 0.6B:UD-Q4_K_XL

Other Special Notes

  • Temperature: 0.6
  • Reasoning
    • Evaluated both with reasoning (/think) and without (/no_think).
    • In both cases, <think>...</think> tags were removed.
  • Evaluator: gemini-2.0-flash-exp

Results

Speed

For reference, these are one-time measurements taken when 389 tokens were input using Ollama. The unit is [tokens/s].

Model prompt eval rate eval rate
Qwen3-235B-A22B:UD-Q3_K_XL_think 56.08 16.60
Qwen3-30B-A3B:UD-Q4_K_XL_think 422.21 55.66
Qwen3-32B:UD-Q4_K_XL_think 188.86 20.07
Qwen3-14B:UD-Q4_K_XL_think 426.52 40.91
Qwen3-8B:UD-Q4_K_XL_think 748.22 58.30
Qwen3-4B:UD-Q4_K_XL_think 1177.05 79.79
Qwen3-1.7B:UD-Q4_K_XL_think 2401.57 131.22
Qwen3-0.6B:UD-Q4_K_XL_think 4267.43 171.23

Shaberi3 Benchmark Results

The results are shown below. First, the results for the Qwen3 models alone are shown.

Model Weighted Mean ELYZA-tasks-100 Japanese MT-Bench Tengu-Bench
32B:UD-Q4_K_XL_think 8.7 9.0 9.4 8.2
14B:UD-Q4_K_XL_think 8.5 8.8 9.2 7.9
30B-A3B:UD-Q4_K_XL_think 8.5 8.7 9.5 7.8
32B:UD-Q4_K_XL【no】_think 8.4 8.6 9.5 7.8
8B:UD-Q4_K_XL_think 8.3 8.6 9.3 7.6
14B:UD-Q4_K_XL【no】_think 8.2 8.3 9.3 7.6
30B-A3B:UD-Q4_K_XL【no】_think 8.1 8.1 9.0 7.7
4B:UD-Q4_K_XL_think 8.0 8.2 9.0 7.3
8B:UD-Q4_K_XL【no】_think 8.0 8.1 9.1 7.3
4B:UD-Q4_K_XL【no】_think 7.2 7.3 8.2 6.7
1.7B:UD-Q4_K_XL_think 7.0 6.9 8.2 6.4
1.7B:UD-Q4_K_XL【no】_think 5.9 5.8 7.0 5.5
0.6B:UD-Q4_K_XL_think 4.9 5.2 5.4 4.4
0.6B:UD-Q4_K_XL【no】_think 4.5 4.7 4.8 4.2

Next, the results compared with other models are shown.

Model Weighted Mean ELYZA-tasks-100 Japanese MT-Bench Tengu-Bench
DeepSeek-V3-0324 8.8 9.4 9.1 8.2
DeepSeek-R1-UD-IQ1_S 8.7 8.9 9.4 8.3
★Qwen3-32B:UD-Q4_K_XL_think 8.7 9.0 9.4 8.2
gpt-4.1-mini-2025-04-14 8.7 9.1 9.3 8.1
gemini-2.0-flash-001 8.6 9.0 9.4 7.8
★Qwen3-14B:UD-Q4_K_XL_think 8.5 8.8 9.2 7.9
★Qwen3-30B-A3B:UD-Q4_K_XL_think 8.5 8.7 9.5 7.8
★Qwen3-32B:UD-Q4_K_XL_no_think 8.4 8.6 9.5 7.8
gemma-3-27b-it-Q8_0.gguf 8.4 8.9 9.2 7.6
★Qwen3-8B:UD-Q4_K_XL_think 8.3 8.6 9.3 7.6
gpt-4o-mini-2024-07-18 8.3 8.6 9.2 7.6
phi4:14b-q4_K_M 8.3 8.5 9.0 7.7

Notable Points

Scores are Higher with Reasoning

One characteristic of the evaluator, gemini-2.0-flash-exp, is that scores tend to be higher when there are more output tokens.

Therefore, I try to make the evaluation as fair as possible by removing the <think>...</think> tags. However, even under these fair conditions, it is clear that the scores with reasoning are significantly higher.

In particular, for Qwen3 models with 1.7B parameters or more, the performance boost from reasoning is quite impressive.

Qwen3-32B:UD-Q4_K_XL_think

Looking at Qwen3-32B:UD-Q4_K_XL_think, it demonstrates performance on par with DeepSeek-V3-0324. However, since DeepSeek-V3-0324 is a model without reasoning, this might not be an entirely fair comparison.

Among models with reasoning, it is comparable to DeepSeek-R1-UD-IQ1_S (DeepSeek-R1 quantized to 1.58-bit), which highlights the high performance of Qwen3-32B.

Qwen3-14B:UD-Q4_K_XL_think and Qwen3-30B-A3B:UD-Q4_K_XL_think

Both of these models outperformed gemma-3-27b and showed slightly lower performance than gemini-2.0-flash-001.

Model prompt eval rate eval rate
Qwen3-14B:UD-Q4_K_XL_think 426.52 40.91
Qwen3-30B-A3B:UD-Q4_K_XL_think 422.21 55.66

In terms of speed, Qwen3-30B-A3B appears to be the better choice.

Qwen3-8B:UD-Q4_K_XL_think

Qwen3-8B:UD-Q4_K_XL_think achieved scores similar to gpt-4o-mini-2024-07-18 and phi4:14b-q4_K_M. Although not a perfectly fair comparison due to the use of reasoning, it scores significantly higher than gemma-2-9B.

Qwen3-4B:UD-Q4_K_XL_think

The scores are comparable to gemma-2-9b and qwen2.5:14b.

Qwen3-1.7B:UD-Q4_K_XL_think and Qwen3-0.6B:UD-Q4_K_XL_think

While they might be useful depending on the application, they don't seem to offer significantly higher performance compared to other models.

General Observations on Models Without Reasoning

Performance has improved steadily compared to the same-sized models in the Qwen-2.5 series.

On the other hand, the scores are clearly lower compared to when reasoning is enabled.

Since response speed is faster without reasoning, it might be useful when you want to utilize the same model for two different use cases.

Summary

In this article, I summarized the speed and Shaberi3 benchmark results for Qwen3.

Even without reasoning, performance has improved steadily from the Qwen2.5 models, and it improves even further when reasoning is enabled.

Since you can switch between them just by adding /think or /no_think to the prompt, it's very useful in situations where you can only load a single model due to memory limits.

Evaluating the largest model, Qwen3-235B-A22B, will likely take some time, so I'll add it here and post on X once the results are ready.

Thank you for reading to the end. I hope to see you in my next post.

Discussion