iTranslated by AI
How to Use llama.cpp to Run GGUF Language Models in iOS Apps
Introduction
If you want to integrate LLMs into macOS or iOS apps, a well-known method is using llama.cpp to load models in the GGUF file format. GGUF is one of the default formats used when loading model files with llama.cpp.
When new open language models are released, quantized versions in GGUF format are often published on Hugging Face by official sources or community members, making them easily downloadable for anyone.
Recently, the Japanese-capable language model "TinySwallow-1.5B," co-developed by Sakana AI and the Swallow team at the Institute of Science Tokyo, demonstrated high response performance despite having only 1.5B parameters. I felt that integrating it into devices is quite realistic. In fact, the model size of TinySwallow-1.5B is about 1.6GB (for reference, Llama-3-ELYZA-JP-8B is nearly 5GB). Running such relatively compact SLMs (Small Language Models) offline within apps on iPhone can be expected to enhance AI features.
For example, I personally want to process tasks such as Japanese translation and summarization offline within an app. It could also be used to enhance proofreading features in editors.
Therefore, I investigated methods for calling GGUF format models from iOS apps. I will introduce three patterns, all of which share the same basic approach:
- Integrate llama.cpp as a build target in your app using SwiftPM
- Write a wrapper to call the C API from Swift
- Use it from SwiftUI
1. Standard Approach: examples/llama.swiftui
The llama.cpp repository includes a SwiftUI app sample.

It works on actual devices by copying any model to the models directory and changing the name used for loading from the Bundle.
private var defaultModelUrl: URL? {
Bundle.main.url(forResource: "tinyswallow-1.5b-instruct-q8_0", withExtension: "gguf", subdirectory: "models")
// Bundle.main.url(forResource: "llama-2-7b-chat", withExtension: "Q2_K.gguf", subdirectory: "models")
}
However, at the current HEAD cfd74c86dbaa95ed30aa6b30e14d8801eb975d63, the build fails. This has been the case since #11110 on 2025/01/06.
/Library/Developer/Xcode/DerivedData/llama.swiftui-ckrcdzqhkwhfhnbyreolkfyeszfk/SourcePackages/checkouts/llama.cpp/Sources/llama/llama.h:3:10 'llama.h' file not found with <angled> include; use "quotes" instead
An easy workaround is to roll back the dependent llama.cpp package to a version where the build was successful and rewrite some parts to call the old APIs.
First, remove the local path reference from the Xcode Swift Package settings.
After that, configure it by specifying the repository URL and the commit ID 6dfcfef0787e9902df29f510b63621f60a09a50b.

Revert the calls to the new APIs added in #11110. This involves only three places, so you can do it manually.
diff --git a/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift b/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
index 998c673d5d31f..477c3e6f2e95b 100644
--- a/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
+++ b/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
@@ -52,8 +52,8 @@ actor LlamaContext {
deinit {
llama_sampler_free(sampling)
llama_batch_free(batch)
+ llama_model_free(model)
llama_free(context)
- llama_free_model(model)
llama_backend_free()
}
@@ -65,7 +65,7 @@ actor LlamaContext {
model_params.n_gpu_layers = 0
print("Running on simulator, force use n_gpu_layers = 0")
#endif
- let model = llama_load_model_from_file(path, model_params)
+ let model = llama_model_load_from_file(path, model_params)
guard let model else {
print("Could not load model at \(path)")
throw LlamaError.couldNotInitializeContext
@@ -151,7 +151,7 @@ actor LlamaContext {
new_token_id = llama_sampler_sample(sampling, context, batch.n_tokens - 1)
- if llama_token_is_eog(model, new_token_id) || n_cur == n_len {
+ if llama_vocab_is_eog(model, new_token_id) || n_cur == n_len {
print("\n")
is_done = true
let new_token_str = String(cString: temporary_invalid_cchars + [0])
As you can see, because llama.cpp development proceeds at the edge version, builds for iOS, which have lower priority, often break.
Here are some helpful resources for finding a working version:
- Change history of the cmake directory
- Change history of examples/llama.swiftui
Monitoring .github/workflows/build.yml is also effective, as the CI runs builds including xcodebuild.
2. Starter Kit: MiniAIChat App
The MiniAIChat app published by giginet incorporates llama.cpp as of November 2024 via SwiftPM, and it should probably work as-is in your environment.
By replacing the Llama-3-ELYZA-JP-8B-q4_k_m in Configurations.swift with any GGUF file, copying it to the models/ directory, and transferring it to an actual device, you can run a user-friendly chat app.
Regarding the Swift wrapper class, while the official llama.swiftui app reuses code from a library called llama-cpp-swift, the MiniAIChat version features more detailed llama.cpp settings and a Swift-like implementation using AsyncSequence, making it an ideal base for a chat app.
To launch on iOS Simulator
Since the iOS Simulator does not support Metal, you must explicitly turn it off.
convenience init(modelPath: URL, params: Parameters) throws {
var modelParams = llama_model_default_params()
#if targetEnvironment(simulator)
modelParams.n_gpu_layers = 0
print("Running on simulator, force use n_gpu_layers = 0")
#endif
3. For Reference: LLMFarm_core.swift
LLMFarm_core.swift is the core library of the LLMFarm app, which was also used in the previously mentioned TinySwallow-1.5B demo.
This library can be installed and used via SwiftPM. It comes with a demo project.
However, the llama.cpp it calls internally is a custom fork.
guinmoon/llama.cpp
This project is best used for referencing the implementation of an app running in a production environment rather than incorporating it as a base right away.
References
Discussion