llama.cpp/ollamaでembeddingsを試す

kun432

llama.cppは元々embeddingsも対応しているっぽいのだけども、マルチリンガルなEmbeddingモデルでllama.cpp対応しているものがほとんどなかった。

どうやら以下で対応したっぽい。

ollamaのリリースノート見てると、Paraphrase-Multilingualにも対応している様子

やってみる。

kun432

事前準備

llama.cppはpullして再度ビルド
ollamaはv0.3.4にアップデート

kun432

ollamaでembeddings

APIリファレンス

$ ollama --version
ollama version is 0.3.4

bge-m3で試す

$ ollama pull bge-m3

$ ollama ls
NAME                                 	ID          	SIZE  	MODIFIED
bge-m3:latest                        	790764642607	1.2 GB	40 seconds ago
(snip)

OllamaのAPI

$ curl http://localhost:11434/api/embed -d '{
  "model": "bge-m3",
  "input": "明日の天気は"
}' | jq -r .

{
  "model": "bge-m3",
  "embeddings": [
    [
      -0.018340165,
      0.03155742,
      -0.0399797,
(snip)
      0.015442215,
      0.023472361,
      0.008206264
    ]
  ],
  "total_duration": 606037407,
  "load_duration": 1132933,
  "prompt_eval_count": 7
}

OpenAI互換API

$ curl https://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "明日の天気は",
    "model": "bge-m3"
  }'

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        -0.018340165,
        0.03155742,
        -0.0399797,
(snip)
        0.015442215,
        0.023472361,
        0.008206264
      ],
      "index": 0
    }
  ],
  "model": "bge-m3",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

kun432

llama.cppでembeddings

ドキュメント

チュートリアルがここにあった

モデルはダウンロード済みとする
https://huggingface.co/BAAI/bge-m3

llama.cppはpullして再度ビルドしておくこと。

ではGGUF変換する

$ python convert_hf_to_gguf.py /data2/bge-m3 --outfile models/bge-m3-f16.gguf

ollamaのbge-m3のページを見ていると、quantization: f16となっているということは、上記で一応はOKなのかな？

https://ollama.com/library/bge-m3

とりあえず動かしてみる。

$ ./llama-embedding -m models/bge-m3-f16.gguf -e -p "明日の天気は" --verbose-prompt -ngl 99

embedding 0: -0.018617  0.031422 -0.039773 (snip) 0.015442  0.023361  0.008304

量子化もやってみる。Q8_0。

$ ./llama-quantize models/bge-m3-f16.gguf models/bge-m3.q8_0.gguf q8_0

サクッと終わる。量子化したモデルでもやってみる。

$ ./llama-embedding -m models/bge-m3.q8_0.gguf -e -p "明日の天気は" --verbose-prompt -ngl 99

embedding 0: -0.017081  0.031800 -0.039282 (snip) 0.016131  0.022680  0.008942

なお、llama-serverだとどちらの場合でもコアダンプしてしまった

$ ./llama-server -m models/bge-m3-f16.gguf --embeddings -c 512 -ngl 99

examples/server/server.cpp:696: GGML_ASSERT(llama_add_eos_token(model) != 1) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: 許可されていない操作です.
No stack.
The program is not being run.
中止 (コアダンプ)

以下と同じ状況

kun432

ollamaで動かして見た感じだと、フルオフロードしてVRAM 2GBぐらい。

$ watch nvidia-smi

Thu Aug  8 00:51:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   50C    P8              11W / 450W |   2148MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1168      G   /usr/lib/xorg/Xorg                          167MiB |
|    0   N/A  N/A      1410      G   /usr/bin/gnome-shell                         15MiB |
|    0   N/A  N/A    634910      C   ...unners/cuda_v11/ollama_llama_server     1938MiB |
+---------------------------------------------------------------------------------------+

$ ollama ps
NAME         	ID          	SIZE  	PROCESSOR	UNTIL
bge-m3:latest	790764642607	1.8 GB	100% GPU 	Forever

kun432

multilingual-e5-largeでも試してみた

$ python convert_hf_to_gguf.py /data/multilingual-e5-large --outfile models/me5-large-f16.gguf

$ ./llama-embedding -m models/me5-large-f16.gguf -e -p "明日の天気は" --verbose-prompt -ngl 99

embedding 0:  0.030236  0.008233 -0.021544 (snip)  0.016648 -0.014328 -0.002308

いけてそう。

kun432

精度的なところはまだわからないので別途確認してみる予定。

kun432

LangChainもLlamaIndexもollamaインテグレーション使うのが良さそう