iTranslated by AI
Evaluating API Servers and Models for Running gpt-oss-20b on AMD Ryzen AI Max+ 395
Introduction
I purchased a mini PC (GMKtec EVO-X2) equipped with the AMD Ryzen AI Max+ 395, primarily for the purpose of running Generative AI locally.
By using Ubuntu as the OS, most of the 128GB system memory can be used as VRAM. Taking advantage of this large VRAM capacity makes it possible to handle relatively large LLM models and extensive context lengths that cannot be run on gaming GPUs. By setting up an API server, I am considering its use for private chats without information leakage and as a programming agent that can be used without limits.
When assuming such usage, I was curious whether llama.cpp or vLLM would be better as the API server software, and whether the accuracy of models converted to GGUF would be an issue. In this article, I will set up API servers using llama.cpp and vLLM, and evaluate the inference performance and accuracy when running gpt-oss-20b.
Environment and Models Used
- OS : Ubuntu 24.04.3
- Kernel : 6.17.0-1006-oem
- ROCm : 7.1.1
- Power Mode : Performance
In the following evaluation, for model weights, I used ggml-org/gpt-oss-20b-GGUF for llama.cpp and openai/gpt-oss-20b for vLLM. I chose gpt-oss-20b to iterate through trials and errors quickly with a relatively lightweight model.
Inference Performance Evaluation
llama-bench
Inference performance was measured using llama-bench, which is included with llama.cpp. Tests were conducted with the Vulkan version (b7524) and the ROCm version (b1143).
Vulkan
| model | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 0 | pp128 | 749.13 ± 19.11 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 0 | pp512 | 1169.23 ± 8.67 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 0 | tg128 | 77.03 ± 0.06 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 0 | tg512 | 76.45 ± 0.15 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 1 | pp128 | 748.98 ± 25.73 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 1 | pp512 | 1298.61 ± 15.65 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 1 | tg128 | 77.79 ± 0.32 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | Vulkan | 99 | 1 | tg512 | 77.42 ± 0.10 |
ROCm
| model | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 0 | pp128 | 313.67 ± 7.25 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 0 | pp512 | 709.07 ± 18.58 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 0 | tg128 | 69.85 ± 0.26 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 0 | tg512 | 68.31 ± 0.14 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 1 | pp128 | 319.07 ± 16.07 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 1 | pp512 | 769.28 ± 13.35 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 1 | tg128 | 72.46 ± 0.18 |
| gpt-oss 20B MXFP4 MoE | 20.91 B | ROCm | 99 | 1 | tg512 | 72.26 ± 0.02 |
The results indicated that the Vulkan version provided higher prompt processing performance. It is possible that ROCm optimization for RDNA 3.5 is still insufficient. In both cases, enabling Flash Attention was found to improve performance. For subsequent evaluations, the Vulkan version will be used with Flash Attention enabled.
llama-server
Performance when operated as an API server was measured using the perf command of EvalScope. It was executed with the following command while varying the concurrency, under the condition of 128 input tokens and 512 output tokens (since gpt-oss is a Reasoning model and output-heavy usage is expected to be common).
evalscope perf \
--api openai \
--url http://127.0.0.1:8000/v1/completions \
--model ggml-org/gpt-oss-20b-GGUF \
--tokenizer-path openai-mirror/gpt-oss-20b \
--dataset random \
--min-prompt-length 128 \
--max-prompt-length 128 \
--max-tokens 512 \
--min-tokens 512 \
--parallel 1 2 4 8 16 32 \
--number 1 2 4 8 16 32 \
--prefix-length 0 \
--extra-args '{"ignore_eos": true}' \
--read-timeout=6000
First, the llama.cpp API server was started with the following command to measure performance.
./llama-server \
-m ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf \
--port 8000 \
-c $((128*1024)) \
-fa on
When increasing the concurrency, it was expected that performance (RPS, Requests Per Second) would improve by reusing data read from memory for processing multiple requests; however, this was not the case, and the RPS was highest when the concurrency was 1.
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 1 │ 0.16 │ 6.387 │ 7.715 │ 66.32 │ 0.504 │ 0.609 │ 0.014 │ 0.014 │ 100.0%│
│ 2 │ 0.08 │ 23.948 │ 24.331 │ 42.76 │ 0.598 │ 0.730 │ 0.046 │ 0.047 │ 100.0%│
│ 4 │ 0.13 │ 24.520 │ 29.723 │ 55.95 │ 1.623 │ 1.624 │ 0.056 │ 0.058 │ 100.0%│
│ 8 │ 0.13 │ 38.942 │ 59.276 │ 56.02 │ 15.231 │ 34.051 │ 0.058 │ 0.064 │ 100.0%│
│ 16 │ 0.14 │ 66.599 │ 114.026 │ 57.64 │ 41.119 │ 85.981 │ 0.062 │ 0.067 │ 100.0%│
│ 32 │ 0.14 │ 129.315 │ 222.546 │ 59.53 │ 103.334 │ 206.254 │ 0.063 │ 0.076 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
This is because llama-server handles only one request at a time by default. To process multiple requests in parallel, it seems necessary to add arguments as follows (referencing the official tutorial). Note that the context length available per request will be the total length divided by the specified concurrency (in the following case, 128K/32 = 4K).
./llama-server \
-m ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf \
--port 8000 \
-c $((128*1024)) \
-fa on \
-np 32 \
--threads-http 32
Although RPS did not increase much, Avg TTFT improved significantly because processing proceeded in parallel. When used in a chat, it is likely more user-friendly to have an early response even if the output is slow. It's unclear why Gen. toks/s does not increase when the concurrency is set high.
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 1 │ 0.14 │ 7.130 │ 7.173 │ 71.44 │ 0.294 │ 0.338 │ 0.013 │ 0.013 │ 100.0%│
│ 2 │ 0.12 │ 15.892 │ 22.998 │ 48.51 │ 0.429 │ 0.537 │ 0.038 │ 0.044 │ 100.0%│
│ 4 │ 0.14 │ 29.068 │ 29.092 │ 70.39 │ 0.712 │ 0.814 │ 0.056 │ 0.056 │ 100.0%│
│ 8 │ 0.19 │ 42.536 │ 42.579 │ 96.20 │ 1.192 │ 1.330 │ 0.081 │ 0.081 │ 100.0%│
│ 16 │ 0.18 │ 64.158 │ 91.265 │ 62.69 │ 2.740 │ 2.889 │ 0.168 │ 0.177 │ 100.0%│
│ 32 │ 0.22 │ 96.277 │ 146.477 │ 77.44 │ 5.166 │ 5.789 │ 0.226 │ 0.277 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
vLLM
vLLM was installed as follows.
uv venv --python 3.12 --seed
source .venv/bin/activate
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm7.1
pip3 install ninja cmake wheel pybind11
git clone https://github.com/triton-lang/triton.git
pushd triton
git checkout 32d63ac
python3 setup.py install
popd
pip install packaging
git clone https://github.com/Dao-AILab/flash-attention.git
pushd flash-attention
git checkout 58fe37f
git submodule update --init
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE python setup.py install
popd
pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
git clone https://github.com/vllm-project/vllm.git
pushd vllm
git checkout v0.13.0
pip install -r requirements/rocm.txt
PYTORCH_ROCM_ARCH=gfx1151 python3 setup.py develop
popd
pip install amdsmi==7.0.2
The vLLM API server was started with the following command.
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
vllm serve openai/gpt-oss-20b --port 8000
In the case of vLLM, RPS improved as the concurrency increased. The TTFT is also better compared to llama-server. However, the RPS and Gen. toks/s are very low when the concurrency is below 8. I tried various things like changing backends or arguments, but there was no improvement. Since the concurrency level won't likely reach that high in typical personal use, the conclusion seems to be that using llama-server is better.
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 1 │ 0.02 │ 42.381 │ 42.381 │ 12.08 │ 0.155 │ 0.155 │ 0.083 │ 0.083 │ 100.0%│
│ 2 │ 0.03 │ 59.305 │ 59.305 │ 17.27 │ 0.225 │ 0.225 │ 0.116 │ 0.116 │ 100.0%│
│ 4 │ 0.04 │ 93.250 │ 93.299 │ 21.95 │ 0.570 │ 0.695 │ 0.181 │ 0.182 │ 100.0%│
│ 8 │ 0.17 │ 46.247 │ 46.279 │ 88.50 │ 0.919 │ 1.046 │ 0.089 │ 0.090 │ 100.0%│
│ 16 │ 0.27 │ 59.205 │ 59.246 │ 138.27 │ 3.175 │ 3.405 │ 0.110 │ 0.115 │ 100.0%│
│ 32 │ 0.49 │ 65.774 │ 65.817 │ 248.93 │ 3.396 │ 3.539 │ 0.122 │ 0.128 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
Plotting the token generation performance when inferring with llama-server and vLLM on a graph results in the following. When the concurrency is 8 or less, using llama-server provides better performance.

Accuracy Evaluation
Next, I will evaluate the accuracy of the LLM. Rather than obtaining absolute performance figures, the goal is to perform a relative evaluation of multiple models under consistent conditions. Here, I executed two types of benchmarks using EvalScope.
MMLU-Pro
MMLU-Pro is a test that evaluates reasoning ability and knowledge application. I conducted the evaluation with the following command. Since running all of the more than 5,000 tests would take a very long time, I evaluated only 10% (574 cases) of the total.
evalscope eval \
--model ggml-org/gpt-oss-20b-GGUF \
--eval-type openai_api \
--api-url http://127.0.0.1:8000/v1 \
--api-key EMPTY \
--datasets mmlu_pro \
--eval-batch-size 32 \
--limit 0.1 \
--timeout 6000 \
--ignore-error
By adding the following options to the evalscope eval command, I was able to specify the model's Reasoning Effort as Low.
- llama.cpp :
--generation-config extra_body={"chat_template_kwargs":{"reasoning_effort":"low"}} - vLLM :
--generation-config extra_body={"reasoning_effort":"low"}
The scores are summarized in the table below.
| Server | Model | Effort | Score |
|---|---|---|---|
| vLLM | openai/gpt-oss-20b | Low | 0.6638 |
| vLLM | openai/gpt-oss-20b | Medium | 0.7038 |
| llama.cpp | ggml-org/gpt-oss-20b-GGUF | Low | 0.6673 |
| llama.cpp | ggml-org/gpt-oss-20b-GGUF | Medium | 0.7178 |
There was no decrease in accuracy with the GGUF model. Since both use the MXFP4 quantization method, the weights might be identical. In fact, GGUF appears to have slightly better accuracy, but I consider a difference of about 1% to be within the margin of error.
The evaluation with Reasoning Effort set to High was omitted because it would take an extremely long time to execute.
MBPP
MBPP is a test that evaluates coding ability. I executed this one with --limit 0.2 (100 cases).
| Server | Model | Effort | Score |
|---|---|---|---|
| vLLM | openai/gpt-oss-20b | Low | 0.8 |
| vLLM | openai/gpt-oss-20b | Medium | 0.9 |
| llama.cpp | ggml-org/gpt-oss-20b-GGUF | Low | 0.77 |
| llama.cpp | ggml-org/gpt-oss-20b-GGUF | Medium | 0.91 |
Although the GGUF model's score decreased when Reasoning Effort was set to Low, it conversely increased at Medium, so it's unlikely that there is any degradation in accuracy.
I also frequently see the SWE-bench benchmark for evaluating coding ability, but when I tried it, the score was zero, so I was unable to evaluate it.
Summary
When running gpt-oss-20b on an AMD Ryzen AI Max+ 395 for personal use, it seems best to use llama.cpp, which offers good responsiveness at low concurrency. Based on my evaluation, the GGUF models distributed for llama.cpp appear to be equivalent in accuracy to the original models.
It took me a month to write this article, through trial and error regarding how to evaluate LLM performance and accuracy, and which server settings to use. While new models continue to emerge, establishing an evaluation method makes it possible to compare them and select the model that best fits your environment and use case.
Discussion