iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦔

Evaluating API Servers and Models for Running gpt-oss-20b on AMD Ryzen AI Max+ 395

に公開

Introduction

I purchased a mini PC (GMKtec EVO-X2) equipped with the AMD Ryzen AI Max+ 395, primarily for the purpose of running Generative AI locally.

By using Ubuntu as the OS, most of the 128GB system memory can be used as VRAM. Taking advantage of this large VRAM capacity makes it possible to handle relatively large LLM models and extensive context lengths that cannot be run on gaming GPUs. By setting up an API server, I am considering its use for private chats without information leakage and as a programming agent that can be used without limits.

When assuming such usage, I was curious whether llama.cpp or vLLM would be better as the API server software, and whether the accuracy of models converted to GGUF would be an issue. In this article, I will set up API servers using llama.cpp and vLLM, and evaluate the inference performance and accuracy when running gpt-oss-20b.

Environment and Models Used

  • OS : Ubuntu 24.04.3
  • Kernel : 6.17.0-1006-oem
  • ROCm : 7.1.1
  • Power Mode : Performance

In the following evaluation, for model weights, I used ggml-org/gpt-oss-20b-GGUF for llama.cpp and openai/gpt-oss-20b for vLLM. I chose gpt-oss-20b to iterate through trials and errors quickly with a relatively lightweight model.

Inference Performance Evaluation

llama-bench

Inference performance was measured using llama-bench, which is included with llama.cpp. Tests were conducted with the Vulkan version (b7524) and the ROCm version (b1143).

Vulkan

model params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 0 pp128 749.13 ± 19.11
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 0 pp512 1169.23 ± 8.67
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 0 tg128 77.03 ± 0.06
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 0 tg512 76.45 ± 0.15
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 1 pp128 748.98 ± 25.73
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 1 pp512 1298.61 ± 15.65
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 1 tg128 77.79 ± 0.32
gpt-oss 20B MXFP4 MoE 20.91 B Vulkan 99 1 tg512 77.42 ± 0.10

ROCm

model params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 0 pp128 313.67 ± 7.25
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 0 pp512 709.07 ± 18.58
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 0 tg128 69.85 ± 0.26
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 0 tg512 68.31 ± 0.14
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 1 pp128 319.07 ± 16.07
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 1 pp512 769.28 ± 13.35
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 1 tg128 72.46 ± 0.18
gpt-oss 20B MXFP4 MoE 20.91 B ROCm 99 1 tg512 72.26 ± 0.02

The results indicated that the Vulkan version provided higher prompt processing performance. It is possible that ROCm optimization for RDNA 3.5 is still insufficient. In both cases, enabling Flash Attention was found to improve performance. For subsequent evaluations, the Vulkan version will be used with Flash Attention enabled.

llama-server

Performance when operated as an API server was measured using the perf command of EvalScope. It was executed with the following command while varying the concurrency, under the condition of 128 input tokens and 512 output tokens (since gpt-oss is a Reasoning model and output-heavy usage is expected to be common).

evalscope perf \
  --api openai \
  --url http://127.0.0.1:8000/v1/completions \
  --model ggml-org/gpt-oss-20b-GGUF \
  --tokenizer-path openai-mirror/gpt-oss-20b \
  --dataset random \
  --min-prompt-length 128 \
  --max-prompt-length 128 \
  --max-tokens 512 \
  --min-tokens 512 \
  --parallel 1 2 4 8 16 32 \
  --number 1 2 4 8 16 32 \
  --prefix-length 0 \
  --extra-args '{"ignore_eos": true}' \
  --read-timeout=6000

First, the llama.cpp API server was started with the following command to measure performance.

./llama-server \
    -m ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf \
    --port 8000 \
    -c $((128*1024)) \
    -fa on

When increasing the concurrency, it was expected that performance (RPS, Requests Per Second) would improve by reusing data read from memory for processing multiple requests; however, this was not the case, and the RPS was highest when the concurrency was 1.

┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.16 │    6.387 │    7.715 │   66.32 │    0.504 │   0.609 │    0.014 │   0.014 │    100.0%│
│    2 │ 0.08 │   23.948 │   24.331 │   42.76 │    0.598 │   0.730 │    0.046 │   0.047 │    100.0%│
│    4 │ 0.13 │   24.520 │   29.723 │   55.95 │    1.623 │   1.624 │    0.056 │   0.058 │    100.0%│
│    8 │ 0.13 │   38.942 │   59.276 │   56.02 │   15.231 │  34.051 │    0.058 │   0.064 │    100.0%│
│   16 │ 0.14 │   66.599 │  114.026 │   57.64 │   41.119 │  85.981 │    0.062 │   0.067 │    100.0%│
│   32 │ 0.14 │  129.315 │  222.546 │   59.53 │  103.334 │ 206.254 │    0.063 │   0.076 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

This is because llama-server handles only one request at a time by default. To process multiple requests in parallel, it seems necessary to add arguments as follows (referencing the official tutorial). Note that the context length available per request will be the total length divided by the specified concurrency (in the following case, 128K/32 = 4K).

./llama-server \
    -m ggml-org/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf \
    --port 8000 \
    -c $((128*1024)) \
    -fa on \
    -np 32 \
    --threads-http 32

Although RPS did not increase much, Avg TTFT improved significantly because processing proceeded in parallel. When used in a chat, it is likely more user-friendly to have an early response even if the output is slow. It's unclear why Gen. toks/s does not increase when the concurrency is set high.

┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.14 │    7.130 │    7.173 │   71.44 │    0.294 │   0.338 │    0.013 │   0.013 │    100.0%│
│    2 │ 0.12 │   15.892 │   22.998 │   48.51 │    0.429 │   0.537 │    0.038 │   0.044 │    100.0%│
│    4 │ 0.14 │   29.068 │   29.092 │   70.39 │    0.712 │   0.814 │    0.056 │   0.056 │    100.0%│
│    8 │ 0.19 │   42.536 │   42.579 │   96.20 │    1.192 │   1.330 │    0.081 │   0.081 │    100.0%│
│   16 │ 0.18 │   64.158 │   91.265 │   62.69 │    2.740 │   2.889 │    0.168 │   0.177 │    100.0%│
│   32 │ 0.22 │   96.277 │  146.477 │   77.44 │    5.166 │   5.789 │    0.226 │   0.277 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

vLLM

vLLM was installed as follows.

uv venv --python 3.12 --seed
source .venv/bin/activate

pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm7.1
pip3 install ninja cmake wheel pybind11

git clone https://github.com/triton-lang/triton.git
pushd triton
git checkout 32d63ac
python3 setup.py install
popd

pip install packaging
git clone https://github.com/Dao-AILab/flash-attention.git
pushd flash-attention
git checkout 58fe37f
git submodule update --init
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE python setup.py install
popd

pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
git clone https://github.com/vllm-project/vllm.git
pushd vllm
git checkout v0.13.0
pip install -r requirements/rocm.txt
PYTORCH_ROCM_ARCH=gfx1151 python3 setup.py develop
popd

pip install amdsmi==7.0.2

The vLLM API server was started with the following command.

FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
    vllm serve openai/gpt-oss-20b --port 8000

In the case of vLLM, RPS improved as the concurrency increased. The TTFT is also better compared to llama-server. However, the RPS and Gen. toks/s are very low when the concurrency is below 8. I tried various things like changing backends or arguments, but there was no improvement. Since the concurrency level won't likely reach that high in typical personal use, the conclusion seems to be that using llama-server is better.

┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.02 │   42.381 │   42.381 │   12.08 │    0.155 │   0.155 │    0.083 │   0.083 │    100.0%│
│    2 │ 0.03 │   59.305 │   59.305 │   17.27 │    0.225 │   0.225 │    0.116 │   0.116 │    100.0%│
│    4 │ 0.04 │   93.250 │   93.299 │   21.95 │    0.570 │   0.695 │    0.181 │   0.182 │    100.0%│
│    8 │ 0.17 │   46.247 │   46.279 │   88.50 │    0.919 │   1.046 │    0.089 │   0.090 │    100.0%│
│   16 │ 0.27 │   59.205 │   59.246 │  138.27 │    3.175 │   3.405 │    0.110 │   0.115 │    100.0%│
│   32 │ 0.49 │   65.774 │   65.817 │  248.93 │    3.396 │   3.539 │    0.122 │   0.128 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

Plotting the token generation performance when inferring with llama-server and vLLM on a graph results in the following. When the concurrency is 8 or less, using llama-server provides better performance.

Accuracy Evaluation

Next, I will evaluate the accuracy of the LLM. Rather than obtaining absolute performance figures, the goal is to perform a relative evaluation of multiple models under consistent conditions. Here, I executed two types of benchmarks using EvalScope.

MMLU-Pro

MMLU-Pro is a test that evaluates reasoning ability and knowledge application. I conducted the evaluation with the following command. Since running all of the more than 5,000 tests would take a very long time, I evaluated only 10% (574 cases) of the total.

evalscope eval \
    --model ggml-org/gpt-oss-20b-GGUF \
    --eval-type openai_api \
    --api-url http://127.0.0.1:8000/v1 \
    --api-key EMPTY \
    --datasets mmlu_pro \
    --eval-batch-size 32 \
    --limit 0.1 \
    --timeout 6000 \
    --ignore-error

By adding the following options to the evalscope eval command, I was able to specify the model's Reasoning Effort as Low.

  • llama.cpp : --generation-config extra_body={"chat_template_kwargs":{"reasoning_effort":"low"}}
  • vLLM : --generation-config extra_body={"reasoning_effort":"low"}

The scores are summarized in the table below.

Server Model Effort Score
vLLM openai/gpt-oss-20b Low 0.6638
vLLM openai/gpt-oss-20b Medium 0.7038
llama.cpp ggml-org/gpt-oss-20b-GGUF Low 0.6673
llama.cpp ggml-org/gpt-oss-20b-GGUF Medium 0.7178

There was no decrease in accuracy with the GGUF model. Since both use the MXFP4 quantization method, the weights might be identical. In fact, GGUF appears to have slightly better accuracy, but I consider a difference of about 1% to be within the margin of error.

The evaluation with Reasoning Effort set to High was omitted because it would take an extremely long time to execute.

MBPP

MBPP is a test that evaluates coding ability. I executed this one with --limit 0.2 (100 cases).

Server Model Effort Score
vLLM openai/gpt-oss-20b Low 0.8
vLLM openai/gpt-oss-20b Medium 0.9
llama.cpp ggml-org/gpt-oss-20b-GGUF Low 0.77
llama.cpp ggml-org/gpt-oss-20b-GGUF Medium 0.91

Although the GGUF model's score decreased when Reasoning Effort was set to Low, it conversely increased at Medium, so it's unlikely that there is any degradation in accuracy.

I also frequently see the SWE-bench benchmark for evaluating coding ability, but when I tried it, the score was zero, so I was unable to evaluate it.

Summary

When running gpt-oss-20b on an AMD Ryzen AI Max+ 395 for personal use, it seems best to use llama.cpp, which offers good responsiveness at low concurrency. Based on my evaluation, the GGUF models distributed for llama.cpp appear to be equivalent in accuracy to the original models.

It took me a month to write this article, through trial and error regarding how to evaluate LLM performance and accuracy, and which server settings to use. While new models continue to emerge, establishing an evaluation method makes it possible to compare them and select the model that best fits your environment and use case.

Discussion