iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🎡

Running Codex with a Local LLM

に公開

Introduction

In this article, I will summarize how to drive the Codex CLI using a local LLM.

Background

One of my hobbies is benchmarking generative AI, and recently I've often been evaluating agent performance when combining LLMs with harnesses like Claude Code or Codex.

Most general harnesses are OpenAI Compatible, meaning they run on the Chat Completion API. Outside of that, Claude Code runs on Anthropic's Messages API.

llama.cpp, an inference engine I often use for benchmarking, supports both the Chat Completion API and the Messages API, allowing it to connect with almost all harnesses. However, it cannot run Codex because Codex requires a Responses-style API.

Because of this, I haven't been able to perform benchmark evaluations combining Codex with local LLMs until now. However, I recently subscribed to ChatGPT and started using the Codex App frequently, which is what led to this initiative: I wanted to perform benchmark evaluations combining it with local LLMs.

My introduction is a bit long, so let's get into the main topic.

LiteLLM Proxy

While it might be possible to convert the API format using a proxy like LiteLLM, proxies generally carry a high risk of performance degradation, and you would need to run another server, which I would prefer to avoid. Therefore, I consider using a proxy server as a last resort and will first look for an inference engine that supports it natively.

llama.cpp

Just in case, I will verify if llama.cpp can truly be used as an LLM backend server for Codex.

In conclusion, it did not work well with llama.cpp, so I look forward to future support. The following is a note for reference.

Although llama.cpp supposedly supports the Responses API, when I actually tried to use it from Codex, the following error occurred.

It seems that if you downgrade the Codex version to 0.87.0 or earlier, it can be used with llama.cpp, but since I want to run benchmark evaluations on the latest version, I decided to give up on running it with llama.cpp for now.

https://unsloth.ai/docs/jp/ji-ben/codex

https://zenn.dev/edna_startup/scraps/e5f7e294b2ede3

LM Studio

I received a comment on X suggesting that if I use LM Studio, it natively supports the Responses-style API, so I will give that a try.

https://x.com/kis/status/2053300771903750173

First, as a prerequisite, I run LLMs on either an RTX 5090 or a Mac Studio, both of which are remote PCs, and I want to avoid using a GUI as much as possible.

Therefore, I will describe the environment setup using lms (LM Studio's CLI) instead of the LM Studio GUI application below.

Introduction to lms

  • Installation
$ curl -fsSL https://lmstudio.ai/install.sh | bash
  • Starting the daemon
$ source ~/.bashrc # For Mac: $ source ~/.zshrc
$ lms daemon up
  • Version check
$ lms version
   __   __  ___  ______          ___        _______   ____
  / /  /  |/  / / __/ /___ _____/ (_)__    / ___/ /  /  _/
 / /__/ /|_/ / _\ \/ __/ // / _  / / _ \  / /__/ /___/ /
/____/_/  /_/ /___/\__/\_,_/\_,_/_/\___/  \___/____/___/

lms is LM Studio's CLI utility for your models, server, and inference runtime.
CLI commit: 0b2a176

Docs: https://lmstudio.ai/docs/developer
Join our Discord: https://discord.gg/lmstudio
Contribute: https://github.com/lmstudio-ai/lms

Loading GGUF

Once lms is up and running, the next step is to load the model. Since I wanted to use GGUF models with the same workflow as the llama-server as much as possible, I had GitHub Copilot + gpt-5.4 create a dedicated shell script for me.

The GGUF file is exactly the same as the one used in llama.cpp, so please replace the paths accordingly.

The general flow of the shell script is as follows:

  1. Register the GGUF to LM Studio (use -l for a symlink if you want to keep the original file)

  2. Load

  3. Start the OpenAI-compatible server

Thus, below I will list the execution command and shell script created to load the ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf model on an RTX 5090 with the alias Qwen3.6-27B-UD-Q4_K_XL.

  • Execution command
$ ./lms-load-and-serve.sh ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf Qwen3.6-27B-UD-Q4_K_XL
                     USER        PID ACCESS COMMAND
8080/tcp:            gosrum    2762964 F.... python3
Using model key: local/Qwen3.6-27B-UD-Q4_K_XL
Using identifier: Qwen3.6-27B-UD-Q4_K_XL
Model file already imported: /home/gosrum/.lmstudio/models/local/Qwen3.6-27B-UD-Q4_K_XL/Qwen3.6-27B-UD-Q4_K_XL.gguf
Loading model with context=200000 parallel=2 gpu=max
Model loaded successfully in 2.67s.
(16.40 GiB)
To use the model in the API/SDK, use the identifier "Qwen3.6-27B-UD-Q4_K_XL".
LM Studio backend is already running on port 8081
Reasoning mode: OFF
No-think proxy is serving on 0.0.0.0:8080 -> 127.0.0.1:8081
Ready.
OpenAI-compatible endpoint: http://localhost:8080/v1
Loaded model identifier: Qwen3.6-27B-UD-Q4_K_XL

Example request:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-27B-UD-Q4_K_XL",
    "messages": [{"role":"user","content":"Hello"}],
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "presence_penalty": 1.5
  }'
  • Shell script
Details
lms-load-and-serve.sh
#!/usr/bin/env bash

set -euo pipefail
fuser -kvn tcp 8080

usage() {
  cat <<'EOF'
Usage:
  ./lms-load-and-serve.sh <gguf-path> <identifier>

Example:
  ./lms-load-and-serve.sh ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf Qwen3.6-27B-UD-Q4_K_XL
  LMS_DISABLE_THINKING=0 ./lms-load-and-serve.sh ../models/Qwen3.6-27B-UD-Q4_K_XL.gguf Qwen3.6-27B-UD-Q4_K_XL

Environment variables:
  MODEL_USER=local                Import destination user name
  MODEL_REPO=<gguf-name>          Import destination repo name
  LMS_IMPORT_MODE=symlink         symlink | hard-link | copy | move
  LMS_BIND=0.0.0.0                Public bind address
  LMS_PORT=8080                   Public API port
  LMS_DISABLE_THINKING=1          1 = reasoning off, 0 = reasoning on
  LMS_BACKEND_PORT=8081           Internal LM Studio port when proxy is enabled
  LMS_GPU=max                     lms load --gpu value
  LMS_CONTEXT_LENGTH=200000       lms load --context-length value
  LMS_PARALLEL=2                  lms load --parallel value

Notes:
  - Re-running the script reuses an already imported model key.
  - The model key defaults to local/<gguf file name without extension>.
  - If the same identifier is already loaded, it is unloaded before reloading.
  - Default is reasoning off.
  - To enable reasoning, run with LMS_DISABLE_THINKING=0.
  - With LMS_DISABLE_THINKING=1, the public endpoint stays on LMS_PORT and a
    local proxy injects an empty assistant <think> block into chat requests.
EOF
}

die() {
  printf 'Error: %s\n' "$*" >&2
  exit 1
}

resolve_lmstudio_home() {
  printf '%s\n' "${LMSTUDIO_HOME:-$HOME/.lmstudio}"
}

resolve_lms_bin() {
  local lmstudio_home

  lmstudio_home="$(resolve_lmstudio_home)"
  if [[ -x "${lmstudio_home}/bin/lms" ]]; then
    printf '%s\n' "${lmstudio_home}/bin/lms"
    return
  fi
  if command -v lms >/dev/null 2>&1; then
    command -v lms
    return
  fi
  die "lms is not in PATH and ${lmstudio_home}/bin/lms does not exist"
}

resolve_models_folder() {
  local lmstudio_home settings_json downloads_folder

  lmstudio_home="$(resolve_lmstudio_home)"
  settings_json="${lmstudio_home}/settings.json"

  if [[ -f "$settings_json" ]]; then
    downloads_folder="$(
      sed -n 's/.*"downloadsFolder"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' "$settings_json" \
        | head -n 1
    )"
    if [[ -n "$downloads_folder" ]]; then
      printf '%s\n' "$downloads_folder"
      return
    fi
  fi

  printf '%s\n' "${lmstudio_home}/models"
}

python_bin() {
  if command -v python3 >/dev/null 2>&1; then
    printf '%s\n' python3
  elif command -v python >/dev/null 2>&1; then
    printf '%s\n' python
  else
    die "python3 or python is required"
  fi
}

auto_name_repo() {
  local file_name=$1
  sed -E 's/(\.Q[^.]{1,5})?\.[^.]+$//' <<<"$file_name"
}

json_field() {
  local json=$1
  local field=$2
  sed -n "s/.*\"${field}\":\\([^,}]*\\).*/\\1/p" <<<"$json" | tr -d '"'
}

wait_for_url() {
  local url=$1
  local label=$2
  local attempt

  for attempt in $(seq 1 40); do
    if curl -fsS -o /dev/null "$url" >/dev/null 2>&1; then
      return 0
    fi
    sleep 0.5
  done

  die "${label} did not become ready: ${url}"
}

stop_pid_file() {
  local pid_file=$1
  local label=$2
  local pid

  if [[ ! -f "$pid_file" ]]; then
    return 0
  fi

  pid="$(cat "$pid_file" 2>/dev/null || true)"
  if [[ -n "$pid" ]] && kill -0 "$pid" >/dev/null 2>&1; then
    printf 'Stopping %s (PID %s)\n' "$label" "$pid"
    kill "$pid"
  fi
  rm -f "$pid_file"
}

write_no_think_proxy() {
  local proxy_script=$1

  mkdir -p "$(dirname "$proxy_script")"
  cat >"$proxy_script" <<'PY'
#!/usr/bin/env python3
import json
import sys
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from urllib.error import HTTPError, URLError
from urllib.request import Request, urlopen

listen_host = sys.argv[1]
listen_port = int(sys.argv[2])
backend_port = int(sys.argv[3])
backend_base = f"http://127.0.0.1:{backend_port}"


def ensure_no_think(payload):
    if not isinstance(payload, dict):
        return payload

    messages = payload.get("messages")
    if isinstance(messages, list):
        if len(messages) > 0 and isinstance(messages[-1], dict) and messages[-1].get("role") == "assistant":
            return payload
        messages.append({"role": "assistant", "content": "<think>\n\n</think>\n\n"})

    return payload


class ProxyHandler(BaseHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def log_message(self, format, *args):
        return

    def do_GET(self):
        self.forward()

    def do_POST(self):
        self.forward()

    def do_DELETE(self):
        self.forward()

    def do_OPTIONS(self):
        self.forward()

    def do_PATCH(self):
        self.forward()

    def forward(self):
        body = None
        content_length = int(self.headers.get("Content-Length", "0") or "0")
        if content_length > 0:
          body = self.rfile.read(content_length)

        headers = {key: value for key, value in self.headers.items() if key.lower() != "host"}
        content_type = headers.get("Content-Type", "")
        if (
            self.command == "POST"
            and body is not None
            and "application/json" in content_type
            and self.path == "/v1/chat/completions"
        ):
            try:
                payload = json.loads(body.decode("utf-8"))
                payload = ensure_no_think(payload)
                body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
                headers["Content-Length"] = str(len(body))
            except Exception:
                pass

        request = Request(f"{backend_base}{self.path}", data=body, headers=headers, method=self.command)

        try:
            with urlopen(request, timeout=300) as response:
                response_body = response.read()
                self.send_response(response.status)
                for key, value in response.headers.items():
                    if key.lower() in {"transfer-encoding", "connection", "content-length"}:
                        continue
                    self.send_header(key, value)
                self.send_header("Content-Length", str(len(response_body)))
                self.end_headers()
                self.wfile.write(response_body)
        except HTTPError as error:
            response_body = error.read()
            self.send_response(error.code)
            for key, value in error.headers.items():
                if key.lower() in {"transfer-encoding", "connection", "content-length"}:
                    continue
                self.send_header(key, value)
            self.send_header("Content-Length", str(len(response_body)))
            self.end_headers()
            self.wfile.write(response_body)
        except URLError as error:
            response_body = str(error).encode("utf-8")
            self.send_response(502)
            self.send_header("Content-Type", "text/plain; charset=utf-8")
            self.send_header("Content-Length", str(len(response_body)))
            self.end_headers()
            self.wfile.write(response_body)


if __name__ == "__main__":
    server = ThreadingHTTPServer((listen_host, listen_port), ProxyHandler)
    server.serve_forever()
PY
  chmod +x "$proxy_script"
}

start_no_think_proxy() {
  local bind_address=$1
  local public_port=$2
  local backend_port=$3
  local lmstudio_home proxy_script proxy_pid_file proxy_log py

  lmstudio_home="$(resolve_lmstudio_home)"
  proxy_script="${lmstudio_home}/.internal/lms-no-think-proxy.py"
  proxy_pid_file="${lmstudio_home}/.internal/lms-no-think-proxy.pid"
  proxy_log="${lmstudio_home}/.internal/lms-no-think-proxy.log"
  py="$(python_bin)"

  write_no_think_proxy "$proxy_script"
  stop_pid_file "$proxy_pid_file" "no-think proxy"

  nohup "$py" "$proxy_script" "$bind_address" "$public_port" "$backend_port" \
    >"$proxy_log" 2>&1 &
  echo "$!" >"$proxy_pid_file"

  wait_for_url "http://127.0.0.1:${public_port}/v1/models" "no-think proxy"
}

if [[ ${1:-} == "-h" || ${1:-} == "--help" ]]; then
  usage
  exit 0
fi

if [[ $# -ne 2 ]]; then
  usage >&2
  exit 1
fi

gguf_input=$1
identifier=$2

[[ -f "$gguf_input" ]] || die "model file not found: $gguf_input"

lms_bin="$(resolve_lms_bin)"
gguf_path="$(cd "$(dirname "$gguf_input")" && pwd -P)/$(basename "$gguf_input")"
gguf_file_name="$(basename "$gguf_path")"
default_model_repo="$(auto_name_repo "$gguf_file_name")"

model_user="${MODEL_USER:-local}"
model_repo="${MODEL_REPO:-$default_model_repo}"
model_key="${model_user}/${model_repo}"
models_folder="$(resolve_models_folder)"
target_path="${models_folder}/${model_user}/${model_repo}/${gguf_file_name}"
exact_model_path="${model_key}/${gguf_file_name}"

bind_address="${LMS_BIND:-0.0.0.0}"
public_port="${LMS_PORT:-8080}"
disable_thinking="${LMS_DISABLE_THINKING:-1}"
gpu_ratio="${LMS_GPU:-max}"
context_length="${LMS_CONTEXT_LENGTH:-200000}"
parallel_count="${LMS_PARALLEL:-2}"
import_mode="${LMS_IMPORT_MODE:-symlink}"

if [[ "$disable_thinking" != "0" ]]; then
  backend_port="${LMS_BACKEND_PORT:-8081}"
  backend_bind="127.0.0.1"
else
  backend_port="$public_port"
  backend_bind="$bind_address"
fi

case "$import_mode" in
  symlink)
    import_flag="-l"
    ;;
  hard-link)
    import_flag="-L"
    ;;
  copy)
    import_flag="-c"
    ;;
  move)
    import_flag=""
    ;;
  *)
    die "unsupported LMS_IMPORT_MODE: $import_mode"
    ;;
esac

printf 'Using model key: %s\n' "$model_key"
printf 'Using identifier: %s\n' "$identifier"

downloaded_models_json="$("$lms_bin" ls --json)"
if [[ -e "$target_path" ]]; then
  printf 'Model file already imported: %s\n' "$target_path"
elif grep -F "\"modelKey\":\"$model_key\"" <<<"$downloaded_models_json" >/dev/null; then
  printf 'Model already imported: %s\n' "$model_key"
else
  import_cmd=("$lms_bin" import -y --user-repo "$model_key")
  if [[ -n "$import_flag" ]]; then
    import_cmd+=("$import_flag")
  fi
  import_cmd+=("$gguf_path")
  printf 'Importing model from: %s\n' "$gguf_path"
  "${import_cmd[@]}"
fi

loaded_models_json="$("$lms_bin" ps --json)"
if grep -F "\"identifier\":\"$identifier\"" <<<"$loaded_models_json" >/dev/null; then
  printf 'Unloading existing model instance: %s\n' "$identifier"
  "$lms_bin" unload "$identifier"
fi

load_cmd=(
  "$lms_bin" load --exact "$exact_model_path"
  --identifier "$identifier"
  --gpu "$gpu_ratio"
  --context-length "$context_length"
  --parallel "$parallel_count"
  -y
)

printf 'Loading model with context=%s parallel=%s gpu=%s\n' \
  "$context_length" "$parallel_count" "$gpu_ratio"
"${load_cmd[@]}"

server_status_json="$("$lms_bin" server status --json 2>/dev/null || true)"
server_running="$(json_field "$server_status_json" "running")"
current_port="$(json_field "$server_status_json" "port")"

if [[ "$server_running" == "true" ]]; then
  if [[ "$current_port" == "$backend_port" ]]; then
    printf 'LM Studio backend is already running on port %s\n' "$current_port"
  else
    printf 'Restarting LM Studio backend from port %s to %s\n' \
      "${current_port:-unknown}" "$backend_port"
    "$lms_bin" server stop
    server_running="false"
  fi
fi

if [[ "$server_running" != "true" ]]; then
  start_cmd=("$lms_bin" server start --bind "$backend_bind" --port "$backend_port")
  printf 'Starting LM Studio backend on %s:%s\n' "$backend_bind" "$backend_port"
  "${start_cmd[@]}"
fi

wait_for_url "http://127.0.0.1:${backend_port}/v1/models" "LM Studio backend"

if [[ "$disable_thinking" != "0" ]]; then
  start_no_think_proxy "$bind_address" "$public_port" "$backend_port"
  printf 'Reasoning mode: OFF\n'
  printf 'No-think proxy is serving on %s:%s -> 127.0.0.1:%s\n' \
    "$bind_address" "$public_port" "$backend_port"
else
  stop_pid_file "$(resolve_lmstudio_home)/.internal/lms-no-think-proxy.pid" "no-think proxy"
  printf 'Reasoning mode: ON\n'
fi

cat <<EOF
Ready.
OpenAI-compatible endpoint: http://localhost:${public_port}/v1
Loaded model identifier: ${identifier}

Example request:
curl http://localhost:${public_port}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "${identifier}",
    "messages": [{"role":"user","content":"Hello"}],
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "presence_penalty": 1.5
  }'
EOF

With this, the lms inference server is up and running.

  • Unloading a model

To unload a model, simply execute the following command:

$ lms unload --all

Calling the lms server's local LLM from Codex CLI

Here is how to call the local LLM API described above from the Codex CLI.

  • Codex installation
$ npm i -g @openai/codex@latest
  • Editing the config file

Add the following to ~/.codex/config.toml:

$ code ~/.codex/config.toml
~/.codex/config.toml
[model_providers.lms]
name = "lms API"
base_url = "http://localhost:8080/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

By setting wire_api = "responses", Codex will call the LM Studio endpoint as a Responses API.

Also, the model name must match the identifier used when loading in lms. In this example, since it was loaded as Qwen3.6-27B-UD-Q4_K_XL, you should specify the same name on the Codex side.

  • Execution
$ codex --model Qwen3.6-27B-UD-Q4_K_XL -c model_provider=lms --search --dangerously-bypass-approvals-and-sandbox

With this, you are now able to drive Codex from a local LLM.

https://x.com/gosrum/status/2053327293825839550

Thoughts after trying it out

After actually running it, I confirmed that the Codex harness itself works without issues even with a local LLM.

On the other hand, practicality depends heavily on model performance and speed. Especially since Codex repeatedly makes tool calls and checks differences, waiting time becomes quite long with slow-inference models.

Also, when used as a coding agent, not just simple chat performance, but instruction following, tool usage, and stability with long contexts become important. Note that just because it works, it doesn't mean you can use it with the same experience as a commercial model immediately.

That said, being able to run Codex with a local LLM makes it easier to compare agent performance between models on the same harness. Personally, I feel this is the biggest benefit.

Summary

In this article, I summarized the methods for driving the Codex CLI with a local LLM, including:

  • Results from trying it with llama.cpp
  • How to load GGUF models using the LM Studio CLI
  • How to set up an LM Studio OpenAI-compatible endpoint
  • Configuration of ~/.codex/config.toml on the Codex side
  • How to call a local LLM from the Codex CLI

At present, using llama.cpp alone did not go well, but I confirmed that I could call a local LLM from Codex by using LM Studio.

Moving forward, I would like to conduct benchmark evaluations when combining Codex and local LLMs.

Thank you for reading to the end. I will continue to share interesting usage methods and convenient tricks on X or in articles when I find them.

Discussion