iTranslated by AI
Verifying Qwen3.6-35B-A3B on RTX 5090: The 18 t/s Trap and the Real Difference from Qwen3.5
This verification was conducted as of April 2026. Figures may change due to updates in llama.cpp or the models themselves.
Introduction
Qwen3.6-35B-A3B was released on April 15, 2026. The previous generation, Qwen3.5-35B-A3B, achieved a sufficient speed of 214 t/s on an RTX 5090 + llama.cpp, which I have been using in my actual AITuber backend.
This time, I conducted a hardware-level verification to see if I could "switch to Qwen3.6." As a result, the true cause of the initial 18 t/s reading was found in an unexpected place, so I have documented the entire process.
Since this article is written in the same environment and from the same perspective as my previous article (RTX 5090 + Qwen3.5 MXFP4/Q4_K_M comparison), I recommend reading them together for easier comparison.
Verification Environment
| Item | Details |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 (32GB VRAM, Blackwell SM 12.0) |
| llama.cpp | b8870 (82209efb7, 2026-04-21 build) |
| Build | cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 |
| OS | WSL2 Ubuntu 24.04 |
| Model Storage | WSL2 ext4 → High-speed mmap |
Target Models and Quantization
For Qwen3.6-35B-A3B, I used the UD (Unsloth Dynamic) GGUF from unsloth.
| Quantization | File Size | Estimated VRAM |
|---|---|---|
| UD-Q4_K_M | 20.6 GB | ~23 GB |
| UD-Q5_K_S | 23.2 GB | ~25 GB |
UD (Unsloth Dynamic) is a mixed quantization method that uses higher precision for important layers and lower precision for less important ones. Its main feature is that it maintains quality even with the same file size as uniform quantization.
As a comparison, I also measured Qwen3.5-35B-A3B Q4_K_M (20.6 GB) at the same time.
Initial Measurement — The Shock of 18 t/s
After downloading the model and running llama-bench, I got these results:
TG 128: 18 t/s
This is about 1/12th of the speed in the environment where Qwen3.5 achieved 214 t/s. I suspected that llama.cpp might not support Qwen3.6, but after investigating, I found a different cause.
Investigation of the Root Cause
I started by comparing it with Qwen3.5.
# Qwen3.5 Q4_K_M
llama-bench ... → TG 214 t/s ✅
# Qwen3.6 Q5_K_S (under same conditions)
llama-bench ... → TG 18 t/s ❌
If it were a compatibility issue with llama.cpp, Qwen3.5 should also have been affected. Since Qwen3.5 was normal, I concluded it was a model-specific issue.
Next, I checked the VRAM usage.
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader,nounits
# → 30067, 2533
VRAM usage was over 30 GB.
The cause was Ollama. Ollama, running on the Windows side, was idling while loading qwen3.6:latest (Q4_K_M, 22.7 GB).
Ollama qwen3.6:latest (Q4_K_M): 22.7 GB
Gemma-4-E4B (Hermes backend): 7 GB
Total: approx. 30 GB → Remaining 2.5 GB
With only 2.5 GB remaining, the Qwen3.6 Q5_K_S (23.2 GB) model for llama.cpp could not fit in the GPU and fell back to the CPU. The 18 t/s was the result of CPU inference.
Verifying the llama.cpp Build
I had been using a version of llama.cpp (b8738, 2026-04-09) built before the Qwen3.6 release (2026-04-15). Since it might have lacked new features, I performed a git pull and rebuilt it.
cd ~/projects/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build --config Release -j$(nproc) -t llama-server llama-bench
# b8738 → b8870 (132 commits updated)
Re-measurement with Clean VRAM + New Binary
I unloaded Ollama (POST with keep_alive: 0) and re-measured after freeing up VRAM to 7 GB.
llama-bench Results (pp512 / tg128 / r=3)
| Condition | PP 512 (t/s) | TG 128 (t/s) |
|---|---|---|
| Qwen3.5 Q4_K_M — v8738 *Reference | 4151 | 214 |
| Qwen3.6 Q5_K_S — v8738 (clean VRAM) | 2382 | 153 |
| Qwen3.6 Q5_K_S — v8870 (clean VRAM) | 3293 | 180 |
| Qwen3.6 Q4_K_M — v8870 (clean VRAM) | 2892 | 183 |
| Qwen3.6 Q5_K_S — VRAM overflow (CPU fallback) | — | 18 ⚠️ |
The update to llama.cpp (b8738→b8870) improved the TG from 153 to 180 t/s (+18%). It seems optimizations specific to Qwen3.6 were included.
*Qwen3.5 was not re-measured with v8870, so it is a reference value. It may be even faster with v8870.
Observation: Almost No Speed Difference Between Q4_K_M and Q5_K_S
The effectiveness of UD quantization is clearly visible.
| Quantization | Size | TG (t/s) | VRAM Diff |
|---|---|---|---|
| Q5_K_S | 23.2 GB | 180 | +2.6 GB |
| Q4_K_M | 20.6 GB | 183 | Baseline |
With uniform quantization, Q5 is usually faster than Q4, but because UD keeps important layers at high precision, Q4_K_M has minimal quality degradation while maintaining equivalent speed. If you want to save VRAM, Q4_K_M is the way to go.
Overall Comparison: Qwen3.5 vs Qwen3.6
| Item | Qwen3.5 Q4_K_M | Qwen3.6 Q4_K_M |
|---|---|---|
| File Size | 20.6 GB | 20.6 GB |
| Estimated VRAM | ~22 GB | ~23 GB |
| PP 512 (t/s) | 4151 | 2892 |
| TG 128 (t/s) | 214 | 183 |
| TG Speed Ratio | Baseline | -15% |
At the same size, Qwen3.5 is about 15% faster.
Looking purely at speed, Qwen3.5 is advantageous. The justification for switching to Qwen3.6 depends on the quality difference.
Quality Evaluation (5 Practical Prompts)
I compared both models by feeding the same prompts via llama-server. Qwen3 has a "Thinking" mode, but I disabled it using --reasoning-budget 0 to match my actual AITuber operation (enabling it generates excessive thinking tokens within <think> tags, causing a noticeable spike in latency). max_tokens=300 is used for both.
Results Summary
| Prompt | Qwen3.5 | Qwen3.6 | Impression |
|---|---|---|---|
| AITuber Character Response | Short & Natural | Ideas based on character name, rich in emojis | 3.6 is fun |
| Japanese Reasoning | Correct/Concise | Correct/Organized in bullet points | Tie |
| Tech Explanation (30s) | Excellent Metaphor | ⚠️ Mixed with "电脑" (Chinese) | 3.5 is stable |
| AI Chatting | Short/Interactive | Longer/Information-dense | Depends on use case |
| Creative Reasoning | Systematic | Same level | Tie |
Here is an excerpt of the actual responses for the two prompts that showed a difference.
AITuber Character Response ("Is Hakushi-chan really smart?")
Qwen3.5:
Hehe, are you saying that? Are you underestimating my, Hakushi-chan's, brain?! Of course I'm smart?! Because I'm a "genius" who can understand anything in an instant!♪
Qwen3.6:
Because my brain is a "blank sheet" (Hakushi), I am high-performance hardware that can instantly clear and rewrite any problem!🧠💡
Qwen3.6 picked up the meaning of the character name "Hakushi" to create its response. As an expression of character personality, 3.6 was a step ahead.
Tech Explanation (Quantum computers for middle school students)
Qwen3.5:
A regular computer uses "1 or 0" switches for calculations, but a quantum computer can handle "1 and 0 at the same time," making it tens of thousands of times faster for complex calculations. It's like going through an entire maze simultaneously.
Qwen3.6:
A regular computer processes information with the "ON" and "OFF" of a lightbulb. On the other hand, quantum computers use a special state that can be "both ON and OFF." Because of this, it can solve complex problems in parallel at once, finishing calculations that would take centuries for conventional 电脑 (computers) in an instant.
Qwen3.5's metaphor of "going through the entire maze simultaneously" was excellent for children. Qwen3.6 had "电脑" (computer in Chinese) mixed in, which becomes a problem in actual operation.
Summary
Insights gained from this verification:
-
The cause of 18 t/s was not llama.cpp, but VRAM overflow.
Starting llama.cpp while an Ollama idle model occupies VRAM causes a CPU fallback. Be careful when using multiple tools simultaneously. -
The build date of llama.cpp is crucial.
Binaries built before the model release do not include model-specific optimizations. It is safer to restart fromgit pullwhen trying out a new model. -
Q4_K_M is sufficient for UD quantization.
There is almost no speed difference between Q5_K_S and Q4_K_M (183 vs 180 t/s). Since you can save 2.5 GB of VRAM, you should choose Q4_K_M if you want to run it alongside ComfyUI or TTS. -
Qwen3.6 is 15% slower than Qwen3.5 (at same size).
If you prioritize speed, Qwen3.5 Q4_K_M is still advantageous. Your decision depends on whether you value the quality difference.
Discussion
5090流石爆速ですね
ありがとうございます!ただ、VRAMを他に食われると一気に18 t/sまで落ちるので、爆速を活かすにはVRAM管理が肝でした笑