iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
😽

How to Use llama.cpp to Run GGUF Language Models in iOS Apps

に公開

Introduction

If you want to integrate LLMs into macOS or iOS apps, a well-known method is using llama.cpp to load models in the GGUF file format. GGUF is one of the default formats used when loading model files with llama.cpp.

When new open language models are released, quantized versions in GGUF format are often published on Hugging Face by official sources or community members, making them easily downloadable for anyone.

https://llm-jp.github.io/awesome-japanese-llm/

Recently, the Japanese-capable language model "TinySwallow-1.5B," co-developed by Sakana AI and the Swallow team at the Institute of Science Tokyo, demonstrated high response performance despite having only 1.5B parameters. I felt that integrating it into devices is quite realistic. In fact, the model size of TinySwallow-1.5B is about 1.6GB (for reference, Llama-3-ELYZA-JP-8B is nearly 5GB). Running such relatively compact SLMs (Small Language Models) offline within apps on iPhone can be expected to enhance AI features.

For example, I personally want to process tasks such as Japanese translation and summarization offline within an app. It could also be used to enhance proofreading features in editors.

Therefore, I investigated methods for calling GGUF format models from iOS apps. I will introduce three patterns, all of which share the same basic approach:

  1. Integrate llama.cpp as a build target in your app using SwiftPM
  2. Write a wrapper to call the C API from Swift
  3. Use it from SwiftUI

1. Standard Approach: examples/llama.swiftui

The llama.cpp repository includes a SwiftUI app sample.

https://github.com/ggerganov/llama.cpp/tree/master/examples/llama.swiftui

It works on actual devices by copying any model to the models directory and changing the name used for loading from the Bundle.

    private var defaultModelUrl: URL? {
        Bundle.main.url(forResource: "tinyswallow-1.5b-instruct-q8_0", withExtension: "gguf", subdirectory: "models")
        // Bundle.main.url(forResource: "llama-2-7b-chat", withExtension: "Q2_K.gguf", subdirectory: "models")
    }

However, at the current HEAD cfd74c86dbaa95ed30aa6b30e14d8801eb975d63, the build fails. This has been the case since #11110 on 2025/01/06.

/Library/Developer/Xcode/DerivedData/llama.swiftui-ckrcdzqhkwhfhnbyreolkfyeszfk/SourcePackages/checkouts/llama.cpp/Sources/llama/llama.h:3:10 'llama.h' file not found with <angled> include; use "quotes" instead

An easy workaround is to roll back the dependent llama.cpp package to a version where the build was successful and rewrite some parts to call the old APIs.

First, remove the local path reference from the Xcode Swift Package settings.
After that, configure it by specifying the repository URL and the commit ID 6dfcfef0787e9902df29f510b63621f60a09a50b.

Revert the calls to the new APIs added in #11110. This involves only three places, so you can do it manually.

diff --git a/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift b/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
index 998c673d5d31f..477c3e6f2e95b 100644
--- a/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
+++ b/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
@@ -52,8 +52,8 @@ actor LlamaContext {
     deinit {
         llama_sampler_free(sampling)
         llama_batch_free(batch)
+        llama_model_free(model)
         llama_free(context)
-        llama_free_model(model)
         llama_backend_free()
     }
 
@@ -65,7 +65,7 @@ actor LlamaContext {
         model_params.n_gpu_layers = 0
         print("Running on simulator, force use n_gpu_layers = 0")
 #endif
-        let model = llama_load_model_from_file(path, model_params)
+        let model = llama_model_load_from_file(path, model_params)
         guard let model else {
             print("Could not load model at \(path)")
             throw LlamaError.couldNotInitializeContext
@@ -151,7 +151,7 @@ actor LlamaContext {
 
         new_token_id = llama_sampler_sample(sampling, context, batch.n_tokens - 1)
 
-        if llama_token_is_eog(model, new_token_id) || n_cur == n_len {
+        if llama_vocab_is_eog(model, new_token_id) || n_cur == n_len {
             print("\n")
             is_done = true
             let new_token_str = String(cString: temporary_invalid_cchars + [0])

As you can see, because llama.cpp development proceeds at the edge version, builds for iOS, which have lower priority, often break.

Here are some helpful resources for finding a working version:

  • Change history of the cmake directory
  • Change history of examples/llama.swiftui

Monitoring .github/workflows/build.yml is also effective, as the CI runs builds including xcodebuild.

2. Starter Kit: MiniAIChat App

The MiniAIChat app published by giginet incorporates llama.cpp as of November 2024 via SwiftPM, and it should probably work as-is in your environment.
https://github.com/giginet/MiniAIChat
https://giginet.hateblo.jp/entry/2024/12/09/110000
By replacing the Llama-3-ELYZA-JP-8B-q4_k_m in Configurations.swift with any GGUF file, copying it to the models/ directory, and transferring it to an actual device, you can run a user-friendly chat app.
Regarding the Swift wrapper class, while the official llama.swiftui app reuses code from a library called llama-cpp-swift, the MiniAIChat version features more detailed llama.cpp settings and a Swift-like implementation using AsyncSequence, making it an ideal base for a chat app.

To launch on iOS Simulator

Since the iOS Simulator does not support Metal, you must explicitly turn it off.

    convenience init(modelPath: URL, params: Parameters) throws {
        var modelParams = llama_model_default_params()
#if targetEnvironment(simulator)
        modelParams.n_gpu_layers = 0
        print("Running on simulator, force use n_gpu_layers = 0")
#endif

3. For Reference: LLMFarm_core.swift

LLMFarm_core.swift is the core library of the LLMFarm app, which was also used in the previously mentioned TinySwallow-1.5B demo.
https://github.com/guinmoon/llmfarm_core.swift
This library can be installed and used via SwiftPM. It comes with a demo project.
However, the llama.cpp it calls internally is a custom fork.
guinmoon/llama.cpp
This project is best used for referencing the implementation of an app running in a production environment rather than incorporating it as a base right away.

References

https://zenn.dev/shu223/articles/localllm-ios
https://zenn.dev/turing_motors/articles/59c829daaa3307

Discussion