iTranslated by AI
Qwen3 Inference Speed and Shaberi3 Benchmark Results
2025-05-03 Update
I have conducted the Shaberi3 benchmark for Qwen3-235B-A22B, so I'm adding the results here.
Inference Speed
I have also added the inference speed [tokens/s] for Qwen3-235B-A22B to the Speed section.
| Model | prompt eval rate | eval rate |
|---|---|---|
| Qwen3-235B-A22B:UD-Q3_K_XL_think | 56.08 | 16.60 |
The eval rate is reasonably fast, but the slow prompt eval rate is a concern.
For example, when using it as a coding agent, inputs of several thousand tokens are common, so it's inevitable that there will be a significant wait from the time the prompt is entered until generation starts.
That said, it seems useful for the following purposes:
- Single-turn processing
- Coding with auto-approval
- Automated batch processing
Shaberi3 Benchmark Results
Notes
At the time of this update, gemini-2.0-flash-exp, which was originally available 1,500 times a day, had been limited to 500 times a day.
Since it can only be used 500 times a day, I judged that I could get a more accurate comparison by evaluating with the higher-performance gemini-2.5-flash-preview-04-17, so I decided to use gemini-2.5-flash-preview-04-17 for the evaluation.
Here are the results.

| Model | Weighted Mean | ELYZA-tasks-100 | Japanese MT-Bench | Tengu-Bench |
|---|---|---|---|---|
| gemini-2.5-flash-preview-04-17 | 9.2 | 9.3 | 9.7 | 8.8 |
| grok-3-mini-beta | 9.0 | 9.3 | 9.7 | 8.5 |
| DeepSeek-R1 | 8.9 | 9.0 | 9.5 | 8.7 |
| gpt-4.1-2025-04-14 | 8.9 | 9.1 | 9.5 | 8.4 |
| ★Qwen3-235B-A22B:UD-Q3_K_XL-think | 8.8 | 8.9 | 9.6 | 8.5 |
| ★qwen3-235b-a22b:free | 8.8 | 9.0 | 9.1 | 8.5 |
| claude-3-7-sonnet-20250219 | 8.7 | 9.1 | 9.5 | 7.9 |
| DeepSeek-V3-0324 | 8.6 | 8.9 | 9.0 | 8.1 |
| ★Qwen3-235B-A22B:UD-Q3_K_XL_no_think | 8.5 | 8.7 | 9.4 | 7.8 |
| ★Qwen3-32B:UD-Q4_K_XL_think | 8.4 | 8.6 | 9.5 | 7.6 |
| DeepSeek-R1-UD-IQ1_S | 8.2 | 8.5 | 8.7 | 7.8 |
| ★Qwen3-30B-A3B:UD-Q4_K_XL-think | 8.0 | 7.6 | 9.3 | 7.6 |
Notable Mentions
Qwen3-235B-A22B
To confirm the impact of quantization, I evaluated both qwen3-235b-a22b:free on OpenRouter and Qwen3-235B-A22B:UD-Q3_K_XL.
As a result, surprisingly, Qwen3-235B-A22B:UD-Q3_K_XL achieved a higher score!
From this fact, the following possibilities can be considered:
- Unsloth's
UD-Q3_K_XLis quantized with almost no degradation. - OpenRouter's
qwen3-235b-a22b:freeis performing some kind of quantization (e.g.,Q4_K_M).
In any case, it is noteworthy that Qwen3-235B-A22B:UD-Q3_K_XL-think surpassed DeepSeek-V3-0324 and achieved a score comparable to DeepSeek-R1-0324.
Comparison between Qwen3-235B-A22B:UD-Q3_K_XL and DeepSeek-R1-UD-IQ1_S
Even with DeepSeek-R1-UD-IQ1_S quantized to 1.58bit, it could not be run with less than 128GB of VRAM, but Qwen3-235B-A22B:UD-Q3_K_XL-think can be run on about 120GB of VRAM.
Furthermore, it achieved a higher score than DeepSeek-R1 quantized to 1.58bit, which demonstrates the high performance of Qwen3-235B-A22B:UD-Q3_K_XL.
Impact of reasoning
Even for Qwen3-235B-A22B:UD-Q3_K_XL, the score with reasoning is clearly higher than without.
That said, even without reasoning, the score is higher than Qwen3-32B:UD-Q4_K_XL_think.
Qwen3 gives the impression of "thinking" for quite a long time, so if you have sufficient VRAM, Qwen3-235B-A22B:UD-Q3_K_XL_no_think might be a better choice in terms of both performance and speed.
Introduction
Who is this article for?
- Those interested in the Japanese language performance of generative AI
- Those interested in Qwen3
Mac Studio (M2 Ultra 128GB)
Overview
I evaluate LLM using the Shaberi3 benchmark as a hobby.
I usually post my results on X, but since Qwen3 has so much to cover, I decided to summarize it in a short article here.
Evaluated Models and Conditions
There was no particular reason, but this time I ran and evaluated the following GGUF models published by Unsloth using Ollama.
Parameter Sizes and Quantization Types Evaluated This Time
- 30B-A3B:UD-Q4_K_XL
- 32B:UD-Q4_K_XL
- 14B:UD-Q4_K_XL
- 8B:UD-Q4_K_XL
- 4B:UD-Q4_K_XL
- 1.7B:UD-Q4_K_XL
- 0.6B:UD-Q4_K_XL
Other Special Notes
- Temperature: 0.6
- Reasoning
- Evaluated both with reasoning (
/think) and without (/no_think). - In both cases,
<think>...</think>tags were removed.
- Evaluated both with reasoning (
- Evaluator:
gemini-2.0-flash-exp
Results
Speed
For reference, these are one-time measurements taken when 389 tokens were input using Ollama. The unit is [tokens/s].
| Model | prompt eval rate | eval rate |
|---|---|---|
| Qwen3-235B-A22B:UD-Q3_K_XL_think | 56.08 | 16.60 |
| Qwen3-30B-A3B:UD-Q4_K_XL_think | 422.21 | 55.66 |
| Qwen3-32B:UD-Q4_K_XL_think | 188.86 | 20.07 |
| Qwen3-14B:UD-Q4_K_XL_think | 426.52 | 40.91 |
| Qwen3-8B:UD-Q4_K_XL_think | 748.22 | 58.30 |
| Qwen3-4B:UD-Q4_K_XL_think | 1177.05 | 79.79 |
| Qwen3-1.7B:UD-Q4_K_XL_think | 2401.57 | 131.22 |
| Qwen3-0.6B:UD-Q4_K_XL_think | 4267.43 | 171.23 |
Shaberi3 Benchmark Results
The results are shown below. First, the results for the Qwen3 models alone are shown.

| Model | Weighted Mean | ELYZA-tasks-100 | Japanese MT-Bench | Tengu-Bench |
|---|---|---|---|---|
| 32B:UD-Q4_K_XL_think | 8.7 | 9.0 | 9.4 | 8.2 |
| 14B:UD-Q4_K_XL_think | 8.5 | 8.8 | 9.2 | 7.9 |
| 30B-A3B:UD-Q4_K_XL_think | 8.5 | 8.7 | 9.5 | 7.8 |
| 32B:UD-Q4_K_XL【no】_think | 8.4 | 8.6 | 9.5 | 7.8 |
| 8B:UD-Q4_K_XL_think | 8.3 | 8.6 | 9.3 | 7.6 |
| 14B:UD-Q4_K_XL【no】_think | 8.2 | 8.3 | 9.3 | 7.6 |
| 30B-A3B:UD-Q4_K_XL【no】_think | 8.1 | 8.1 | 9.0 | 7.7 |
| 4B:UD-Q4_K_XL_think | 8.0 | 8.2 | 9.0 | 7.3 |
| 8B:UD-Q4_K_XL【no】_think | 8.0 | 8.1 | 9.1 | 7.3 |
| 4B:UD-Q4_K_XL【no】_think | 7.2 | 7.3 | 8.2 | 6.7 |
| 1.7B:UD-Q4_K_XL_think | 7.0 | 6.9 | 8.2 | 6.4 |
| 1.7B:UD-Q4_K_XL【no】_think | 5.9 | 5.8 | 7.0 | 5.5 |
| 0.6B:UD-Q4_K_XL_think | 4.9 | 5.2 | 5.4 | 4.4 |
| 0.6B:UD-Q4_K_XL【no】_think | 4.5 | 4.7 | 4.8 | 4.2 |
Next, the results compared with other models are shown.

| Model | Weighted Mean | ELYZA-tasks-100 | Japanese MT-Bench | Tengu-Bench |
|---|---|---|---|---|
| DeepSeek-V3-0324 | 8.8 | 9.4 | 9.1 | 8.2 |
| DeepSeek-R1-UD-IQ1_S | 8.7 | 8.9 | 9.4 | 8.3 |
| ★Qwen3-32B:UD-Q4_K_XL_think | 8.7 | 9.0 | 9.4 | 8.2 |
| gpt-4.1-mini-2025-04-14 | 8.7 | 9.1 | 9.3 | 8.1 |
| gemini-2.0-flash-001 | 8.6 | 9.0 | 9.4 | 7.8 |
| ★Qwen3-14B:UD-Q4_K_XL_think | 8.5 | 8.8 | 9.2 | 7.9 |
| ★Qwen3-30B-A3B:UD-Q4_K_XL_think | 8.5 | 8.7 | 9.5 | 7.8 |
| ★Qwen3-32B:UD-Q4_K_XL_no_think | 8.4 | 8.6 | 9.5 | 7.8 |
| gemma-3-27b-it-Q8_0.gguf | 8.4 | 8.9 | 9.2 | 7.6 |
| ★Qwen3-8B:UD-Q4_K_XL_think | 8.3 | 8.6 | 9.3 | 7.6 |
| gpt-4o-mini-2024-07-18 | 8.3 | 8.6 | 9.2 | 7.6 |
| phi4:14b-q4_K_M | 8.3 | 8.5 | 9.0 | 7.7 |
Notable Points
Scores are Higher with Reasoning
One characteristic of the evaluator, gemini-2.0-flash-exp, is that scores tend to be higher when there are more output tokens.
Therefore, I try to make the evaluation as fair as possible by removing the <think>...</think> tags. However, even under these fair conditions, it is clear that the scores with reasoning are significantly higher.
In particular, for Qwen3 models with 1.7B parameters or more, the performance boost from reasoning is quite impressive.
Qwen3-32B:UD-Q4_K_XL_think
Looking at Qwen3-32B:UD-Q4_K_XL_think, it demonstrates performance on par with DeepSeek-V3-0324. However, since DeepSeek-V3-0324 is a model without reasoning, this might not be an entirely fair comparison.
Among models with reasoning, it is comparable to DeepSeek-R1-UD-IQ1_S (DeepSeek-R1 quantized to 1.58-bit), which highlights the high performance of Qwen3-32B.
Qwen3-14B:UD-Q4_K_XL_think and Qwen3-30B-A3B:UD-Q4_K_XL_think
Both of these models outperformed gemma-3-27b and showed slightly lower performance than gemini-2.0-flash-001.
| Model | prompt eval rate | eval rate |
|---|---|---|
| Qwen3-14B:UD-Q4_K_XL_think | 426.52 | 40.91 |
| Qwen3-30B-A3B:UD-Q4_K_XL_think | 422.21 | 55.66 |
In terms of speed, Qwen3-30B-A3B appears to be the better choice.
Qwen3-8B:UD-Q4_K_XL_think
Qwen3-8B:UD-Q4_K_XL_think achieved scores similar to gpt-4o-mini-2024-07-18 and phi4:14b-q4_K_M. Although not a perfectly fair comparison due to the use of reasoning, it scores significantly higher than gemma-2-9B.
Qwen3-4B:UD-Q4_K_XL_think
The scores are comparable to gemma-2-9b and qwen2.5:14b.
Qwen3-1.7B:UD-Q4_K_XL_think and Qwen3-0.6B:UD-Q4_K_XL_think
While they might be useful depending on the application, they don't seem to offer significantly higher performance compared to other models.
General Observations on Models Without Reasoning
Performance has improved steadily compared to the same-sized models in the Qwen-2.5 series.
On the other hand, the scores are clearly lower compared to when reasoning is enabled.
Since response speed is faster without reasoning, it might be useful when you want to utilize the same model for two different use cases.
Summary
In this article, I summarized the speed and Shaberi3 benchmark results for Qwen3.
Even without reasoning, performance has improved steadily from the Qwen2.5 models, and it improves even further when reasoning is enabled.
Since you can switch between them just by adding /think or /no_think to the prompt, it's very useful in situations where you can only load a single model due to memory limits.
Evaluating the largest model, Qwen3-235B-A22B, will likely take some time, so I'll add it here and post on X once the results are ready.
Thank you for reading to the end. I hope to see you in my next post.
Discussion