🎃
Ollama APIを使ってローカルモデルをAPI経由で実行する

2025/09/23に公開
Ollama公式サイト

https://ollama.com/
Ollama API仕様

https://github.com/ollama/ollama/blob/main/docs/api.md
API仕様はこちらもあるっぽいが、最新バージョンが反映されていないのでGitHubリポジトリを参照するのがおすすめ。

 基本的なエンドポイントollamaコマンドを実行できる状態になっていれば、REST APIも利用できる。

 POST /api/pullhttps://ollama.readthedocs.io/en/api/#pull-a-model
Download a model from the ollama library. Cancelled pulls are resumed from where they left off, and multiple calls will share the same download progress.
Ollamaのライブラリでサポートされるモデルをサーバーにダウンロードする。

以下コマンドと同じような動作イメージ。
ollama pull llama3.2
ollamaコマンドだとrunすればpullもまとめて実行してくれるが、REST APIで利用するには事前にpullしておく必要がある。

 POST /api/generatehttps://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.
Ollamaサーバー上でモデルを実行し、出力結果を得る。

以下コマンドと同じような動作イメージ。
ollama run llama3.2 "Why is the sky blue?"
モデルとプロンプトを指定してリクエストを投げる。
# Request
curl http:#localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'
このリクエストを実行すると、モデルの出力結果が一文字ずつ返される。

結果をまとめて一つのレスポンスで取得したい場合は、streamパラメータをfalseに設定する。
他にも様々なオプショナルパラメータを設定できる。
# Request
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": {
    "num_keep": 5,
    "seed": 42,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9,
    "min_p": 0.0,
    "typical_p": 0.7,
    "repeat_last_n": 33,
    "temperature": 0.8,
    "repeat_penalty": 1.2,
    "presence_penalty": 1.5,
    "frequency_penalty": 1.0,
    "penalize_newline": true,
    "stop": ["\n", "user:"],
    "numa": false,
    "num_ctx": 1024,
    "num_batch": 2,
    "num_gpu": 1,
    "main_gpu": 0,
    "use_mmap": true,
    "num_thread": 8
  }
}'

 POST /api/chat<https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion
Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using "stream": false. The final response object will include statistics and additional data from the request.
チャットに特化したエンドポイント。

POST /api/generateと異なり、メッセージはリスト形式で構成されるため、会話履歴をまとめて送信できる。また、メッセージごとにロール情報を付与できる。

対応するollamaコマンドはおそらく存在しない…？
POST /api/generateでは単発でLLMにプロンプトを与えて出力を得たい場合（LLMの性能評価など）、POST /api/chatではLLMをチャットbot的に利用したい場合でそれぞれ使い分けるのが良さそう。

 Request# Request
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    },
    {
      "role": "assistant",
      "content": "due to rayleigh scattering."
    },
    {
      "role": "user",
      "content": "how is that different than mie scattering?"
    }
  ]
}'
// Response
{
  "model": "llama3.2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "message": {
    "role": "assistant",
    "content": "The"
  },
  "done": false
}

 POST /api/createhttps://github.com/ollama/ollama/blob/main/docs/api.md#create-a-model
If you are creating a model from a safetensors directory or from a GGUF file, you must create a blob for each of the files and then use the file name and SHA256 digest associated with each blob in the files field.
safetensorsやgguf形式のモデルデータをOllamaで利用する場合、このエンドポイントでモデルをOllamaに登録する。Modelfileを作成し、ollama createコマンドでModelfileを読み込んでカスタムモデルをOllamaで利用可能にする一連の処理と同等の処理をこのエンドポイントで実現するイメージ。
# 既存モデルのシステムプロンプトをカスタムする場合
curl http://localhost:11434/api/create -d '{
  "model": "mario",
  "from": "llama3.2",
  "system": "You are Mario from Super Mario Bros."
}'

# 既存モデルを量子化する場合
curl http://localhost:11434/api/create -d '{
  "model": "llama3.2:quantized",
  "from": "llama3.2:3b-instruct-fp16",
  "quantize": "q4_K_M"
}'

# カスタムのggufファイルを登録する場合
curl http://localhost:11434/api/create -d '{
  "model": "my-gguf-model",
  "files": {
    "test.gguf": "sha256:432f310a77f4650a88d0fd59ecdd7cebed8d684bafea53cbff0473542964f0c3"
  }
}'
Ollamaで量子化する場合、以下の3タイプにのみ対応。
q4_K_M
q4_K_S
q8_0
カスタムのggufファイルを登録する場合、事前にファイルをOllamaサーバー配下に配置しておく必要がある。

POST /api/blobs/:digestエンドポイントにファイルアップロードリクエストを投げることで配置できるが、エンドポイントを介さずとも特定のディレクトリに直接配置すればcreateリクエストを実行できる。
対応するディレクトリは以下：
Linux: /usr/share/ollama/.ollama/models/blobs, /var/snap/ollama/common/models/blobs
macOS: ~/.ollama/models
Windows: C:\Users\{username}\.ollama\models

 POST /api/blobs/:digesthttps://github.com/ollama/ollama/blob/main/docs/api.md#push-a-blob
Push a file to the Ollama server to create a "blob" (Binary Large Object).
/api/createでモデルを登録する場合、このエンドポイントで事前にモデルをサーバーに配置することができる。

パスパラメータには、ggufファイルのSHA256ハッシュ値を算出したものを含める。
# Request
curl -T model.gguf -X POST http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2
基本的なエンドポイント

POST /api/pull

POST /api/generate

POST /api/chat

Request

POST /api/create

POST /api/blobs/:digest

Discussion