iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🙄

How Ollama Invokes Local LLMs

に公開

Ollama and llama.cpp

Ollama is known to use llama.cpp internally to execute LLM text generation. By using quantized GGML/GGUF models designed for llama.cpp, it makes it possible to run LLMs even with the processing power of a typical user's PC, which would normally require high-performance GPUs and RAM for model inference.

I had a vague understanding up to that point, but when I wanted to use my own custom models, I needed to know more about llama.cpp. So, I decided to investigate the internal specifications starting from the source code of the ollama command.

https://github.com/ollama/ollama

ollama server and llama server

Ollama provides a REST API that serves as a backend service implemented in Go. This is referred to as the "ollama server." The ollama command communicates with this backend service.

The ollama server is a web server that acts as a wrapper for calling the llama.cpp library. For example, it loads and sets up models that the user has pulled (downloaded) inside the server.

The ollama server also internally executes a program called ollama_llama_server, which possesses web server functionality within llama.cpp. This is a feature provided by llama.cpp, known as the "llama server."

The actual entity performing text generation is this llama server implemented in C++; the ollama server receives the results and performs post-processing. This is how the ollama command enables chat-style interactions.

https://github.com/ollama/ollama/blob/105186aa179c7ccbac03d6719ab1c58ab87d6477/llm/server.go#L264-L291

Ollama and CGo

In addition to calling via the llama server, Ollama uses CGo to call functions from the llama.cpp library. For example, the ollama create command, used to create custom models in Ollama, calls the llama_model_quantize() function in llama.cpp to execute model quantization.

https://github.com/ollama/ollama/blob/105186aa179c7ccbac03d6719ab1c58ab87d6477/llm/llm.go#L23-L39

What is being downloaded with ollama pull?

When you execute commands like ollama pull mistral, it refers to Ollama's registry to download the model. At this time, by default, binary files hosted on registry.ollama.ai are downloaded.

For example, at https://ollama.com/library/llama3:latest, registry.ollama.ai/library/llama3:latest is referenced, and fragmented files are saved under ~/.ollama.

.ollama/models/manifests/registry.ollama.ai/library/llama3
.ollama/models/blobs/sha256-109037bec39c0becc8221222ae23557559bc594290945a2c4221ab4f303b8871
.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
.ollama/models/blobs/sha256-8ab4849b038cf0abc5b1c9b8ee1443dca6b93a045c2272180d985126eb40bf6f

This registry is also used when users upload models created locally using ollama push.

Therefore, it means that the version might differ from those hosted on Hugging Face.

Discussion