iTranslated by AI
How to Access vLLM OpenAI-Compatible API with Local Safetensors via Python OpenAI Library
As the title suggests. This also serves as a personal memo.
For details, please refer to the following:
Local safetensors files
In this case, after training with unsloth, we pass the directory math_100k/model saved with the following command to vLLM.
model.save_pretrained_merged("math_100k/model", tokenizer, save_method = "merged_16bit",)
The following files were saved:
your_model/
โโโ chat_template.jinja
โโโ config.json
โโโ generation_config.json
โโโ model-00001-of-0000X.safetensors
โโโ special_tokens_map.json
โโโ tokenizer.json
โโโ tokenizer.model
โโโ tokenizer_config.json
Needs verification: There might be files that are unnecessary for vLLM to load.
Running the vLLM server
Specify the directory for the model and tokenizer, and set the IP and port according to your use case.
python -m vllm.entrypoints.openai.api_server \
--model math_100k/model \
--tokenizer math_100k/model \
--host 0.0.0.0 --port 8000 \
--chat-template math_100k/model/chat_template.jinja
How to call from OpenAI API
Specifying the URL and other settings
Even if an API key is not set, you need to pass some string instead of an empty string.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
Using an empty string resulted in an APIConnectionError: Connection error..
Creating a model list
for i_model in client.models.list():
print(i_model)
The response is as follows:
Model(id='math_100k/model', created=1752901244, object='model', owned_by='vllm', root='math_100k/model', parent=None, max_model_len=8192, permission=[{'id': 'modelperm-ca09efcda32a4e05a4b16e53ff723041', 'object': 'model_permission', 'created': 1752901244, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}])
Inference in chat format
As shown in the model list above, specify the model directory math_100k/model for the model parameter, just as when starting vLLM.
completion = client.chat.completions.create(
model="math_100k/model",
messages=[
{"role": "user", "content": [
{"type": "text", "text": "ใใใซใกใฏ๏ผ"},
]
}
]
)
Response
print(completion.choices[0].message.content)
Retrieve it using:
ใใใซใกใฏ๏ผไปๆฅใฏใฉใใใพใใใ๏ผ
That's all.
Log output when loading on the vLLM side
Just in case, I'll make a note of the display when it was loaded successfully.
(unslothbw) kurogane@kurogane-B650-LiveMixer:/media/kurogane/HD-NRLD-A/projects/system_prompts$ python -m vllm.entrypoints.openai.api_server \
--model math_100k/model \
--tokenizer math_100k/model \
--host 0.0.0.0 --port 8000 \
--chat-template math_100k/model/chat_template.jinja
INFO 07-19 13:55:05 [__init__.py:244] Automatically detected platform cuda.
INFO 07-19 13:55:06 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-19 13:55:06 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'chat_template': 'math_100k/model/chat_template.jinja', 'model': 'math_100k/model', 'tokenizer': 'math_100k/model'}
INFO 07-19 13:55:08 [config.py:841] This model supports multiple tasks: {'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-19 13:55:08 [config.py:1472] Using max model len 8192
INFO 07-19 13:55:09 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-19 13:55:12 [__init__.py:244] Automatically detected platform cuda.
INFO 07-19 13:55:13 [core.py:526] Waiting for init message from front-end.
INFO 07-19 13:55:13 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='math_100k/model', speculative_config=None, tokenizer='math_100k/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=math_100k/model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
INFO 07-19 13:55:14 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-19 13:55:14 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 07-19 13:55:14 [gpu_model_runner.py:1770] Starting to load model math_100k/model...
INFO 07-19 13:55:14 [gpu_model_runner.py:1775] Loading model from scratch...
INFO 07-19 13:55:14 [cuda.py:284] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.58it/s]
INFO 07-19 13:55:15 [default_loader.py:272] Loading weights took 0.81 seconds
INFO 07-19 13:55:15 [gpu_model_runner.py:1801] Model loading took 6.3075 GiB and 0.922393 seconds
INFO 07-19 13:55:18 [backends.py:508] Using cache directory: /home/kurogane/.cache/vllm/torch_compile_cache/da6b69fd1d/rank_0_0/backbone for vLLM's torch.compile
INFO 07-19 13:55:18 [backends.py:519] Dynamo bytecode transform time: 3.50 s
INFO 07-19 13:55:21 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 2.742 s
INFO 07-19 13:55:22 [monitor.py:34] torch.compile takes 3.50 s in total
INFO 07-19 13:55:22 [gpu_worker.py:232] Available KV cache memory: 20.83 GiB
INFO 07-19 13:55:22 [kv_cache_utils.py:716] GPU KV cache size: 136,496 tokens
INFO 07-19 13:55:22 [kv_cache_utils.py:720] Maximum concurrency for 8,192 tokens per request: 16.66x
Capturing CUDA graph shapes: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 67/67 [00:09<00:00, 6.98it/s]
INFO 07-19 13:55:32 [gpu_model_runner.py:2326] Graph capturing finished in 10 secs, took 0.63 GiB
INFO 07-19 13:55:32 [core.py:172] init engine (profile, create kv cache, warmup model) took 17.10 seconds
INFO 07-19 13:55:34 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 8531
WARNING 07-19 13:55:35 [config.py:1392] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-19 13:55:35 [serving_chat.py:125] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_p': 0.9}
INFO 07-19 13:55:35 [serving_completion.py:72] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_p': 0.9}
INFO 07-19 13:55:35 [api_server.py:1457] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 07-19 13:55:35 [launcher.py:29] Available routes are:
INFO 07-19 13:55:35 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /health, Methods: GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /load, Methods: GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /version, Methods: GET
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /score, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-19 13:55:35 [launcher.py:37] Route: /metrics, Methods: GET
INFO: Started server process [10190]
INFO: Waiting for application startup.
INFO: Application startup complete.
Discussion