iTranslated by AI
Running Codex with a Local LLM
Introduction
In this article, I will summarize how to drive the Codex CLI using a local LLM.
Background
One of my hobbies is benchmarking generative AI, and recently I've often been evaluating agent performance when combining LLMs with harnesses like Claude Code or Codex.
Most general harnesses are OpenAI Compatible, meaning they run on the Chat Completion API. Outside of that, Claude Code runs on Anthropic's Messages API.
llama.cpp, an inference engine I often use for benchmarking, supports both the Chat Completion API and the Messages API, allowing it to connect with almost all harnesses. However, it cannot run Codex because Codex requires a Responses-style API.
Because of this, I haven't been able to perform benchmark evaluations combining Codex with local LLMs until now. However, I recently subscribed to ChatGPT and started using the Codex App frequently, which is what led to this initiative: I wanted to perform benchmark evaluations combining it with local LLMs.
My introduction is a bit long, so let's get into the main topic.
LiteLLM Proxy
While it might be possible to convert the API format using a proxy like LiteLLM, proxies generally carry a high risk of performance degradation, and you would need to run another server, which I would prefer to avoid. Therefore, I consider using a proxy server as a last resort and will first look for an inference engine that supports it natively.
llama.cpp
Just in case, I will verify if llama.cpp can truly be used as an LLM backend server for Codex.
In conclusion, it did not work well with llama.cpp, so I look forward to future support. The following is a note for reference.
Although llama.cpp supposedly supports the Responses API, when I actually tried to use it from Codex, the following error occurred.

It seems that if you downgrade the Codex version to 0.87.0 or earlier, it can be used with llama.cpp, but since I want to run benchmark evaluations on the latest version, I decided to give up on running it with llama.cpp for now.
LM Studio
I received a comment on X suggesting that if I use LM Studio, it natively supports the Responses-style API, so I will give that a try.
First, as a prerequisite, I run LLMs on either an RTX 5090 or a Mac Studio, both of which are remote PCs, and I want to avoid using a GUI as much as possible.
Therefore, I will describe the environment setup using lms (LM Studio's CLI) instead of the LM Studio GUI application below.
Introduction to lms
- Installation
$ curl -fsSL https://lmstudio.ai/install.sh | bash
- Starting the daemon
$ source ~/.bashrc # For Mac: $ source ~/.zshrc
$ lms daemon up
- Version check
$ lms version
__ __ ___ ______ ___ _______ ____
/ / / |/ / / __/ /___ _____/ (_)__ / ___/ / / _/
/ /__/ /|_/ / _\ \/ __/ // / _ / / _ \ / /__/ /___/ /
/____/_/ /_/ /___/\__/\_,_/\_,_/_/\___/ \___/____/___/
lms is LM Studio's CLI utility for your models, server, and inference runtime.
CLI commit: 0b2a176
Docs: https://lmstudio.ai/docs/developer
Join our Discord: https://discord.gg/lmstudio
Contribute: https://github.com/lmstudio-ai/lms
Loading GGUF
Once lms is up and running, the next step is to load the model. Since I wanted to use GGUF models with the same workflow as the llama-server as much as possible, I had GitHub Copilot + gpt-5.4 create a dedicated shell script for me.
The GGUF file is exactly the same as the one used in llama.cpp, so please replace the paths accordingly.
The general flow of the shell script is as follows:
-
Register the GGUF to LM Studio (use -l for a symlink if you want to keep the original file)
-
Load
-
Start the OpenAI-compatible server
Thus, below I will list the execution command and shell script created to load the ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf model on an RTX 5090 with the alias Qwen3.6-27B-UD-Q4_K_XL.
- Execution command
$ ./lms-load-and-serve.sh ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf Qwen3.6-27B-UD-Q4_K_XL
USER PID ACCESS COMMAND
8080/tcp: gosrum 2762964 F.... python3
Using model key: local/Qwen3.6-27B-UD-Q4_K_XL
Using identifier: Qwen3.6-27B-UD-Q4_K_XL
Model file already imported: /home/gosrum/.lmstudio/models/local/Qwen3.6-27B-UD-Q4_K_XL/Qwen3.6-27B-UD-Q4_K_XL.gguf
Loading model with context=200000 parallel=2 gpu=max
Model loaded successfully in 2.67s.
(16.40 GiB)
To use the model in the API/SDK, use the identifier "Qwen3.6-27B-UD-Q4_K_XL".
LM Studio backend is already running on port 8081
Reasoning mode: OFF
No-think proxy is serving on 0.0.0.0:8080 -> 127.0.0.1:8081
Ready.
OpenAI-compatible endpoint: http://localhost:8080/v1
Loaded model identifier: Qwen3.6-27B-UD-Q4_K_XL
Example request:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-27B-UD-Q4_K_XL",
"messages": [{"role":"user","content":"Hello"}],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"presence_penalty": 1.5
}'
- Shell script
Details
#!/usr/bin/env bash
set -euo pipefail
fuser -kvn tcp 8080
usage() {
cat <<'EOF'
Usage:
./lms-load-and-serve.sh <gguf-path> <identifier>
Example:
./lms-load-and-serve.sh ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf Qwen3.6-27B-UD-Q4_K_XL
LMS_DISABLE_THINKING=0 ./lms-load-and-serve.sh ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf Qwen3.6-27B-UD-Q4_K_XL
Environment variables:
MODEL_USER=local Import destination user name
MODEL_REPO=<gguf-name> Import destination repo name
LMS_IMPORT_MODE=symlink symlink | hard-link | copy | move
LMS_BIND=0.0.0.0 Public bind address
LMS_PORT=8080 Public API port
LMS_DISABLE_THINKING=1 1 = reasoning off, 0 = reasoning on
LMS_BACKEND_PORT=8081 Internal LM Studio port when proxy is enabled
LMS_GPU=max lms load --gpu value
LMS_CONTEXT_LENGTH=200000 lms load --context-length value
LMS_PARALLEL=2 lms load --parallel value
Notes:
- Re-running the script reuses an already imported model key.
- The model key defaults to local/<gguf file name without extension>.
- If the same identifier is already loaded, it is unloaded before reloading.
- Default is reasoning off.
- To enable reasoning, run with LMS_DISABLE_THINKING=0.
- With LMS_DISABLE_THINKING=1, the public endpoint stays on LMS_PORT and a
local proxy injects an empty assistant <think> block into chat requests.
EOF
}
die() {
printf 'Error: %s\n' "$*" >&2
exit 1
}
resolve_lmstudio_home() {
printf '%s\n' "${LMSTUDIO_HOME:-$HOME/.lmstudio}"
}
resolve_lms_bin() {
local lmstudio_home
lmstudio_home="$(resolve_lmstudio_home)"
if [[ -x "${lmstudio_home}/bin/lms" ]]; then
printf '%s\n' "${lmstudio_home}/bin/lms"
return
fi
if command -v lms >/dev/null 2>&1; then
command -v lms
return
fi
die "lms is not in PATH and ${lmstudio_home}/bin/lms does not exist"
}
resolve_models_folder() {
local lmstudio_home settings_json downloads_folder
lmstudio_home="$(resolve_lmstudio_home)"
settings_json="${lmstudio_home}/settings.json"
if [[ -f "$settings_json" ]]; then
downloads_folder="$(
sed -n 's/.*"downloadsFolder"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' "$settings_json" \
| head -n 1
)"
if [[ -n "$downloads_folder" ]]; then
printf '%s\n' "$downloads_folder"
return
fi
fi
printf '%s\n' "${lmstudio_home}/models"
}
python_bin() {
if command -v python3 >/dev/null 2>&1; then
printf '%s\n' python3
elif command -v python >/dev/null 2>&1; then
printf '%s\n' python
else
die "python3 or python is required"
fi
}
auto_name_repo() {
local file_name=$1
sed -E 's/(\.Q[^.]{1,5})?\.[^.]+$//' <<<"$file_name"
}
json_field() {
local json=$1
local field=$2
sed -n "s/.*\"${field}\":\\([^,}]*\\).*/\\1/p" <<<"$json" | tr -d '"'
}
wait_for_url() {
local url=$1
local label=$2
local attempt
for attempt in $(seq 1 40); do
if curl -fsS -o /dev/null "$url" >/dev/null 2>&1; then
return 0
fi
sleep 0.5
done
die "${label} did not become ready: ${url}"
}
stop_pid_file() {
local pid_file=$1
local label=$2
local pid
if [[ ! -f "$pid_file" ]]; then
return 0
fi
pid="$(cat "$pid_file" 2>/dev/null || true)"
if [[ -n "$pid" ]] && kill -0 "$pid" >/dev/null 2>&1; then
printf 'Stopping %s (PID %s)\n' "$label" "$pid"
kill "$pid"
fi
rm -f "$pid_file"
}
write_no_think_proxy() {
local proxy_script=$1
mkdir -p "$(dirname "$proxy_script")"
cat >"$proxy_script" <<'PY'
#!/usr/bin/env python3
import json
import sys
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from urllib.error import HTTPError, URLError
from urllib.request import Request, urlopen
listen_host = sys.argv[1]
listen_port = int(sys.argv[2])
backend_port = int(sys.argv[3])
backend_base = f"http://127.0.0.1:{backend_port}"
def ensure_no_think(payload):
if not isinstance(payload, dict):
return payload
messages = payload.get("messages")
if isinstance(messages, list):
if len(messages) > 0 and isinstance(messages[-1], dict) and messages[-1].get("role") == "assistant":
return payload
messages.append({"role": "assistant", "content": "<think>\n\n</think>\n\n"})
return payload
class ProxyHandler(BaseHTTPRequestHandler):
protocol_version = "HTTP/1.1"
def log_message(self, format, *args):
return
def do_GET(self):
self.forward()
def do_POST(self):
self.forward()
def do_DELETE(self):
self.forward()
def do_OPTIONS(self):
self.forward()
def do_PATCH(self):
self.forward()
def forward(self):
body = None
content_length = int(self.headers.get("Content-Length", "0") or "0")
if content_length > 0:
body = self.rfile.read(content_length)
headers = {key: value for key, value in self.headers.items() if key.lower() != "host"}
content_type = headers.get("Content-Type", "")
if (
self.command == "POST"
and body is not None
and "application/json" in content_type
and self.path == "/v1/chat/completions"
):
try:
payload = json.loads(body.decode("utf-8"))
payload = ensure_no_think(payload)
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
headers["Content-Length"] = str(len(body))
except Exception:
pass
request = Request(f"{backend_base}{self.path}", data=body, headers=headers, method=self.command)
try:
with urlopen(request, timeout=300) as response:
response_body = response.read()
self.send_response(response.status)
for key, value in response.headers.items():
if key.lower() in {"transfer-encoding", "connection", "content-length"}:
continue
self.send_header(key, value)
self.send_header("Content-Length", str(len(response_body)))
self.end_headers()
self.wfile.write(response_body)
except HTTPError as error:
response_body = error.read()
self.send_response(error.code)
for key, value in error.headers.items():
if key.lower() in {"transfer-encoding", "connection", "content-length"}:
continue
self.send_header(key, value)
self.send_header("Content-Length", str(len(response_body)))
self.end_headers()
self.wfile.write(response_body)
except URLError as error:
response_body = str(error).encode("utf-8")
self.send_response(502)
self.send_header("Content-Type", "text/plain; charset=utf-8")
self.send_header("Content-Length", str(len(response_body)))
self.end_headers()
self.wfile.write(response_body)
if __name__ == "__main__":
server = ThreadingHTTPServer((listen_host, listen_port), ProxyHandler)
server.serve_forever()
PY
chmod +x "$proxy_script"
}
start_no_think_proxy() {
local bind_address=$1
local public_port=$2
local backend_port=$3
local lmstudio_home proxy_script proxy_pid_file proxy_log py
lmstudio_home="$(resolve_lmstudio_home)"
proxy_script="${lmstudio_home}/.internal/lms-no-think-proxy.py"
proxy_pid_file="${lmstudio_home}/.internal/lms-no-think-proxy.pid"
proxy_log="${lmstudio_home}/.internal/lms-no-think-proxy.log"
py="$(python_bin)"
write_no_think_proxy "$proxy_script"
stop_pid_file "$proxy_pid_file" "no-think proxy"
nohup "$py" "$proxy_script" "$bind_address" "$public_port" "$backend_port" \
>"$proxy_log" 2>&1 &
echo "$!" >"$proxy_pid_file"
wait_for_url "http://127.0.0.1:${public_port}/v1/models" "no-think proxy"
}
if [[ ${1:-} == "-h" || ${1:-} == "--help" ]]; then
usage
exit 0
fi
if [[ $# -ne 2 ]]; then
usage >&2
exit 1
fi
gguf_input=$1
identifier=$2
[[ -f "$gguf_input" ]] || die "model file not found: $gguf_input"
lms_bin="$(resolve_lms_bin)"
gguf_path="$(cd "$(dirname "$gguf_input")" && pwd -P)/$(basename "$gguf_input")"
gguf_file_name="$(basename "$gguf_path")"
default_model_repo="$(auto_name_repo "$gguf_file_name")"
model_user="${MODEL_USER:-local}"
model_repo="${MODEL_REPO:-$default_model_repo}"
model_key="${model_user}/${model_repo}"
models_folder="$(resolve_models_folder)"
target_path="${models_folder}/${model_user}/${model_repo}/${gguf_file_name}"
exact_model_path="${model_key}/${gguf_file_name}"
bind_address="${LMS_BIND:-0.0.0.0}"
public_port="${LMS_PORT:-8080}"
disable_thinking="${LMS_DISABLE_THINKING:-1}"
gpu_ratio="${LMS_GPU:-max}"
context_length="${LMS_CONTEXT_LENGTH:-200000}"
parallel_count="${LMS_PARALLEL:-2}"
import_mode="${LMS_IMPORT_MODE:-symlink}"
if [[ "$disable_thinking" != "0" ]]; then
backend_port="${LMS_BACKEND_PORT:-8081}"
backend_bind="127.0.0.1"
else
backend_port="$public_port"
backend_bind="$bind_address"
fi
case "$import_mode" in
symlink)
import_flag="-l"
;;
hard-link)
import_flag="-L"
;;
copy)
import_flag="-c"
;;
move)
import_flag=""
;;
*)
die "unsupported LMS_IMPORT_MODE: $import_mode"
;;
esac
printf 'Using model key: %s\n' "$model_key"
printf 'Using identifier: %s\n' "$identifier"
downloaded_models_json="$("$lms_bin" ls --json)"
if [[ -e "$target_path" ]]; then
printf 'Model file already imported: %s\n' "$target_path"
elif grep -F "\"modelKey\":\"$model_key\"" <<<"$downloaded_models_json" >/dev/null; then
printf 'Model already imported: %s\n' "$model_key"
else
import_cmd=("$lms_bin" import -y --user-repo "$model_key")
if [[ -n "$import_flag" ]]; then
import_cmd+=("$import_flag")
fi
import_cmd+=("$gguf_path")
printf 'Importing model from: %s\n' "$gguf_path"
"${import_cmd[@]}"
fi
loaded_models_json="$("$lms_bin" ps --json)"
if grep -F "\"identifier\":\"$identifier\"" <<<"$loaded_models_json" >/dev/null; then
printf 'Unloading existing model instance: %s\n' "$identifier"
"$lms_bin" unload "$identifier"
fi
load_cmd=(
"$lms_bin" load --exact "$exact_model_path"
--identifier "$identifier"
--gpu "$gpu_ratio"
--context-length "$context_length"
--parallel "$parallel_count"
-y
)
printf 'Loading model with context=%s parallel=%s gpu=%s\n' \
"$context_length" "$parallel_count" "$gpu_ratio"
"${load_cmd[@]}"
server_status_json="$("$lms_bin" server status --json 2>/dev/null || true)"
server_running="$(json_field "$server_status_json" "running")"
current_port="$(json_field "$server_status_json" "port")"
if [[ "$server_running" == "true" ]]; then
if [[ "$current_port" == "$backend_port" ]]; then
printf 'LM Studio backend is already running on port %s\n' "$current_port"
else
printf 'Restarting LM Studio backend from port %s to %s\n' \
"${current_port:-unknown}" "$backend_port"
"$lms_bin" server stop
server_running="false"
fi
fi
if [[ "$server_running" != "true" ]]; then
start_cmd=("$lms_bin" server start --bind "$backend_bind" --port "$backend_port")
printf 'Starting LM Studio backend on %s:%s\n' "$backend_bind" "$backend_port"
"${start_cmd[@]}"
fi
wait_for_url "http://127.0.0.1:${backend_port}/v1/models" "LM Studio backend"
if [[ "$disable_thinking" != "0" ]]; then
start_no_think_proxy "$bind_address" "$public_port" "$backend_port"
printf 'Reasoning mode: OFF\n'
printf 'No-think proxy is serving on %s:%s -> 127.0.0.1:%s\n' \
"$bind_address" "$public_port" "$backend_port"
else
stop_pid_file "$(resolve_lmstudio_home)/.internal/lms-no-think-proxy.pid" "no-think proxy"
printf 'Reasoning mode: ON\n'
fi
cat <<EOF
Ready.
OpenAI-compatible endpoint: http://localhost:${public_port}/v1
Loaded model identifier: ${identifier}
Example request:
curl http://localhost:${public_port}/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{
"model": "${identifier}",
"messages": [{"role":"user","content":"Hello"}],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"presence_penalty": 1.5
}'
EOF
With this, the lms inference server is up and running.
- Unloading a model
To unload a model, simply execute the following command:
$ lms unload --all
Calling the lms server's local LLM from Codex CLI
Here is how to call the local LLM API described above from the Codex CLI.
- Codex installation
$ npm i -g @openai/codex@latest
- Editing the config file
Add the following to ~/.codex/config.toml:
$ code ~/.codex/config.toml
[model_providers.lms]
name = "lms API"
base_url = "http://localhost:8080/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000
By setting wire_api = "responses", Codex will call the LM Studio endpoint as a Responses API.
Also, the model name must match the identifier used when loading in lms. In this example, since it was loaded as Qwen3.6-27B-UD-Q4_K_XL, you should specify the same name on the Codex side.
- Execution
$ codex --model Qwen3.6-27B-UD-Q4_K_XL -c model_provider=lms --search --dangerously-bypass-approvals-and-sandbox
With this, you are now able to drive Codex from a local LLM.
Thoughts after trying it out
After actually running it, I confirmed that the Codex harness itself works without issues even with a local LLM.
On the other hand, practicality depends heavily on model performance and speed. Especially since Codex repeatedly makes tool calls and checks differences, waiting time becomes quite long with slow-inference models.
Also, when used as a coding agent, not just simple chat performance, but instruction following, tool usage, and stability with long contexts become important. Note that just because it works, it doesn't mean you can use it with the same experience as a commercial model immediately.
That said, being able to run Codex with a local LLM makes it easier to compare agent performance between models on the same harness. Personally, I feel this is the biggest benefit.
Summary
In this article, I summarized the methods for driving the Codex CLI with a local LLM, including:
- Results from trying it with llama.cpp
- How to load GGUF models using the LM Studio CLI
- How to set up an LM Studio OpenAI-compatible endpoint
- Configuration of
~/.codex/config.tomlon the Codex side - How to call a local LLM from the Codex CLI
At present, using llama.cpp alone did not go well, but I confirmed that I could call a local LLM from Codex by using LM Studio.
Moving forward, I would like to conduct benchmark evaluations when combining Codex and local LLMs.
Thank you for reading to the end. I will continue to share interesting usage methods and convenient tricks on X or in articles when I find them.
Discussion