🦴

Phi-3 miniをllama.cppでローカルGPUで動かしてみる

2024/05/01に公開

はじめに

MicrosoftのPhi-3-minillama.cppを使ってローカルPCのGPUで動かしてみました。

試した環境は以下のとおりです。

  • Core i9-13900
  • 64GB RAM
  • GeForce RTX 4090
  • Windows 11 Pro

llama.cpp & CUDA

llama.cppがCUDA経由でGPUを使うようにビルドします

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd $_
cmake .. -DLLAMA_CUBLAS=ON

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
CMake Warning at CMakeLists.txt:387 (message):
  LLAMA_CUBLAS is deprecated and will be removed in the future.

  Use LLAMA_CUDA instead


-- Found CUDAToolkit: /usr/local/cuda-12.2/include (found version "12.2.91")
-- CUDA found
-- The CUDA compiler identification is NVIDIA 12.2.91
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda-12.2/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- CUDA host compiler is GNU 11.4.0

-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (3.8s)
-- Generating done (0.7s)
-- Build files have been written to: /home/user/src/llama.cpp/build
cmake --build . --config Release

[  1%] Building C object CMakeFiles/ggml.dir/ggml.c.o
[  1%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[  2%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[  2%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
[  3%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
[  3%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/alibi.cu.o
[  4%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/arange.cu.o
[  4%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/argsort.cu.o
[  5%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/binbcast.cu.o
[  5%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/clamp.cu.o
[  6%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/concat.cu.o
[  6%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/convert.cu.o
[  6%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/cpy.cu.o
[  7%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/diagmask.cu.o
[  7%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/dmmv.cu.o
[  8%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/getrows.cu.o
[  8%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/im2col.cu.o
[  9%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/mmq.cu.o
[  9%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/mmvq.cu.o
[ 10%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/norm.cu.o
[ 10%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/pad.cu.o
[ 11%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/pool2d.cu.o
[ 11%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/quantize.cu.o
[ 12%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/rope.cu.o
[ 12%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/scale.cu.o
[ 13%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/softmax.cu.o
[ 13%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/sumrows.cu.o
[ 14%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/tsembd.cu.o
[ 14%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/unary.cu.o
[ 15%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda/upscale.cu.o
[ 15%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda.cu.o
[ 16%] Building CXX object CMakeFiles/ggml.dir/sgemm.cpp.o
[ 16%] Built target ggml
[ 16%] Linking CXX static library libggml_static.a
[ 16%] Built target ggml_static
[ 17%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
[ 17%] Building CXX object CMakeFiles/llama.dir/unicode.cpp.o
[ 18%] Building CXX object CMakeFiles/llama.dir/unicode-data.cpp.o
[ 18%] Linking CXX static library libllama.a
[ 18%] Built target llama
[ 18%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 19%] Built target build_info
[ 20%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 20%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 21%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 21%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o
[ 22%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 22%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o
[ 23%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 23%] Linking CXX static library libcommon.a
[ 23%] Built target common
[ 24%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 24%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/get-model.cpp.o
[ 25%] Linking CXX executable ../bin/test-quantize-fns
[ 25%] Built target test-quantize-fns
[ 25%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 26%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/get-model.cpp.o
[ 26%] Linking CXX executable ../bin/test-quantize-perf
[ 26%] Built target test-quantize-perf
[ 26%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 27%] Building CXX object tests/CMakeFiles/test-sampling.dir/get-model.cpp.o
[ 27%] Linking CXX executable ../bin/test-sampling
[ 27%] Built target test-sampling
[ 27%] Building CXX object tests/CMakeFiles/test-chat-template.dir/test-chat-template.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/test-chat-template.dir/get-model.cpp.o
[ 28%] Linking CXX executable ../bin/test-chat-template
[ 28%] Built target test-chat-template
[ 29%] Building CXX object tests/CMakeFiles/test-tokenizer-0-llama.dir/test-tokenizer-0-llama.cpp.o
[ 29%] Building CXX object tests/CMakeFiles/test-tokenizer-0-llama.dir/get-model.cpp.o
[ 30%] Linking CXX executable ../bin/test-tokenizer-0-llama
[ 30%] Built target test-tokenizer-0-llama
[ 31%] Building CXX object tests/CMakeFiles/test-tokenizer-0-falcon.dir/test-tokenizer-0-falcon.cpp.o
[ 31%] Building CXX object tests/CMakeFiles/test-tokenizer-0-falcon.dir/get-model.cpp.o
[ 31%] Linking CXX executable ../bin/test-tokenizer-0-falcon
[ 31%] Built target test-tokenizer-0-falcon
[ 32%] Building CXX object tests/CMakeFiles/test-tokenizer-1-llama.dir/test-tokenizer-1-llama.cpp.o
[ 32%] Building CXX object tests/CMakeFiles/test-tokenizer-1-llama.dir/get-model.cpp.o
[ 33%] Linking CXX executable ../bin/test-tokenizer-1-llama
[ 33%] Built target test-tokenizer-1-llama
[ 34%] Building CXX object tests/CMakeFiles/test-tokenizer-1-baichuan.dir/test-tokenizer-1-llama.cpp.o
[ 34%] Building CXX object tests/CMakeFiles/test-tokenizer-1-baichuan.dir/get-model.cpp.o
[ 35%] Linking CXX executable ../bin/test-tokenizer-1-baichuan
[ 35%] Built target test-tokenizer-1-baichuan
[ 35%] Building CXX object tests/CMakeFiles/test-tokenizer-1-falcon.dir/test-tokenizer-1-bpe.cpp.o
[ 36%] Building CXX object tests/CMakeFiles/test-tokenizer-1-falcon.dir/get-model.cpp.o
[ 36%] Linking CXX executable ../bin/test-tokenizer-1-falcon
[ 36%] Built target test-tokenizer-1-falcon
[ 36%] Building CXX object tests/CMakeFiles/test-tokenizer-1-aquila.dir/test-tokenizer-1-bpe.cpp.o
[ 37%] Building CXX object tests/CMakeFiles/test-tokenizer-1-aquila.dir/get-model.cpp.o
[ 37%] Linking CXX executable ../bin/test-tokenizer-1-aquila
[ 37%] Built target test-tokenizer-1-aquila
[ 37%] Building CXX object tests/CMakeFiles/test-tokenizer-1-mpt.dir/test-tokenizer-1-bpe.cpp.o
[ 38%] Building CXX object tests/CMakeFiles/test-tokenizer-1-mpt.dir/get-model.cpp.o
[ 38%] Linking CXX executable ../bin/test-tokenizer-1-mpt
[ 38%] Built target test-tokenizer-1-mpt
[ 38%] Building CXX object tests/CMakeFiles/test-tokenizer-1-stablelm-3b-4e1t.dir/test-tokenizer-1-bpe.cpp.o
[ 39%] Building CXX object tests/CMakeFiles/test-tokenizer-1-stablelm-3b-4e1t.dir/get-model.cpp.o
[ 39%] Linking CXX executable ../bin/test-tokenizer-1-stablelm-3b-4e1t
[ 39%] Built target test-tokenizer-1-stablelm-3b-4e1t
[ 40%] Building CXX object tests/CMakeFiles/test-tokenizer-1-gpt-neox.dir/test-tokenizer-1-bpe.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/test-tokenizer-1-gpt-neox.dir/get-model.cpp.o
[ 41%] Linking CXX executable ../bin/test-tokenizer-1-gpt-neox
[ 41%] Built target test-tokenizer-1-gpt-neox
[ 42%] Building CXX object tests/CMakeFiles/test-tokenizer-1-refact.dir/test-tokenizer-1-bpe.cpp.o
[ 42%] Building CXX object tests/CMakeFiles/test-tokenizer-1-refact.dir/get-model.cpp.o
[ 43%] Linking CXX executable ../bin/test-tokenizer-1-refact
[ 43%] Built target test-tokenizer-1-refact
[ 44%] Building CXX object tests/CMakeFiles/test-tokenizer-1-starcoder.dir/test-tokenizer-1-bpe.cpp.o
[ 44%] Building CXX object tests/CMakeFiles/test-tokenizer-1-starcoder.dir/get-model.cpp.o
[ 45%] Linking CXX executable ../bin/test-tokenizer-1-starcoder
[ 45%] Built target test-tokenizer-1-starcoder
[ 45%] Building CXX object tests/CMakeFiles/test-tokenizer-1-gpt2.dir/test-tokenizer-1-bpe.cpp.o
[ 46%] Building CXX object tests/CMakeFiles/test-tokenizer-1-gpt2.dir/get-model.cpp.o
[ 46%] Linking CXX executable ../bin/test-tokenizer-1-gpt2
[ 46%] Built target test-tokenizer-1-gpt2
[ 47%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/test-grammar-parser.cpp.o
[ 47%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/get-model.cpp.o
[ 48%] Linking CXX executable ../bin/test-grammar-parser
[ 48%] Built target test-grammar-parser
[ 49%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
[ 49%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/get-model.cpp.o
[ 50%] Linking CXX executable ../bin/test-llama-grammar
[ 50%] Built target test-llama-grammar
[ 50%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/test-grammar-integration.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/get-model.cpp.o
[ 51%] Linking CXX executable ../bin/test-grammar-integration
[ 51%] Built target test-grammar-integration
[ 52%] Building CXX object tests/CMakeFiles/test-grad0.dir/test-grad0.cpp.o
[ 52%] Building CXX object tests/CMakeFiles/test-grad0.dir/get-model.cpp.o
[ 53%] Linking CXX executable ../bin/test-grad0
[ 53%] Built target test-grad0
[ 54%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/test-backend-ops.cpp.o
[ 54%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/get-model.cpp.o
[ 55%] Linking CXX executable ../bin/test-backend-ops
[ 55%] Built target test-backend-ops
[ 56%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 56%] Building CXX object tests/CMakeFiles/test-rope.dir/get-model.cpp.o
[ 57%] Linking CXX executable ../bin/test-rope
[ 57%] Built target test-rope
[ 57%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/test-model-load-cancel.cpp.o
[ 58%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/get-model.cpp.o
[ 58%] Linking CXX executable ../bin/test-model-load-cancel
[ 58%] Built target test-model-load-cancel
[ 59%] Building CXX object tests/CMakeFiles/test-autorelease.dir/test-autorelease.cpp.o
[ 59%] Building CXX object tests/CMakeFiles/test-autorelease.dir/get-model.cpp.o
[ 59%] Linking CXX executable ../bin/test-autorelease
[ 59%] Built target test-autorelease
[ 59%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/test-json-schema-to-grammar.cpp.o
[ 60%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/get-model.cpp.o
[ 60%] Linking CXX executable ../bin/test-json-schema-to-grammar
[ 60%] Built target test-json-schema-to-grammar
[ 60%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
[ 61%] Linking CXX executable ../bin/test-c
[ 61%] Built target test-c
[ 61%] Building CXX object examples/baby-llama/CMakeFiles/baby-llama.dir/baby-llama.cpp.o
[ 61%] Linking CXX executable ../../bin/baby-llama
[ 61%] Built target baby-llama
[ 62%] Building CXX object examples/batched/CMakeFiles/batched.dir/batched.cpp.o
[ 62%] Linking CXX executable ../../bin/batched
[ 62%] Built target batched
[ 63%] Building CXX object examples/batched-bench/CMakeFiles/batched-bench.dir/batched-bench.cpp.o
[ 63%] Linking CXX executable ../../bin/batched-bench
[ 63%] Built target batched-bench
[ 64%] Building CXX object examples/beam-search/CMakeFiles/beam-search.dir/beam-search.cpp.o
[ 64%] Linking CXX executable ../../bin/beam-search
[ 64%] Built target beam-search
[ 65%] Building CXX object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-matmult.cpp.o
[ 65%] Linking CXX executable ../../bin/benchmark
[ 65%] Built target benchmark
[ 66%] Building CXX object examples/convert-llama2c-to-ggml/CMakeFiles/convert-llama2c-to-ggml.dir/convert-llama2c-to-ggml.cpp.o
[ 66%] Linking CXX executable ../../bin/convert-llama2c-to-ggml
[ 66%] Built target convert-llama2c-to-ggml
[ 67%] Building CXX object examples/embedding/CMakeFiles/embedding.dir/embedding.cpp.o
[ 67%] Linking CXX executable ../../bin/embedding
[ 67%] Built target embedding
[ 68%] Building CXX object examples/eval-callback/CMakeFiles/eval-callback.dir/eval-callback.cpp.o
[ 68%] Linking CXX executable ../../bin/eval-callback
[ 68%] Built target eval-callback
[ 69%] Building CXX object examples/finetune/CMakeFiles/finetune.dir/finetune.cpp.o
[ 69%] Linking CXX executable ../../bin/finetune
[ 69%] Built target finetune
[ 70%] Building CXX object examples/gritlm/CMakeFiles/gritlm.dir/gritlm.cpp.o
[ 70%] Linking CXX executable ../../bin/gritlm
[ 70%] Built target gritlm
[ 71%] Building CXX object examples/gguf-split/CMakeFiles/gguf-split.dir/gguf-split.cpp.o
[ 71%] Linking CXX executable ../../bin/gguf-split
[ 71%] Built target gguf-split
[ 72%] Building CXX object examples/infill/CMakeFiles/infill.dir/infill.cpp.o
[ 72%] Linking CXX executable ../../bin/infill
[ 72%] Built target infill
[ 73%] Building CXX object examples/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 73%] Linking CXX executable ../../bin/llama-bench
[ 73%] Built target llama-bench
[ 74%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
[ 74%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
[ 74%] Built target llava
[ 74%] Linking CXX static library libllava_static.a
[ 74%] Built target llava_static
[ 75%] Building CXX object examples/llava/CMakeFiles/llava-cli.dir/llava-cli.cpp.o
[ 75%] Linking CXX executable ../../bin/llava-cli
[ 75%] Built target llava-cli
[ 76%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 76%] Linking CXX executable ../../bin/main
[ 76%] Built target main
[ 76%] Building CXX object examples/tokenize/CMakeFiles/tokenize.dir/tokenize.cpp.o
[ 77%] Linking CXX executable ../../bin/tokenize
[ 77%] Built target tokenize
[ 78%] Building CXX object examples/parallel/CMakeFiles/parallel.dir/parallel.cpp.o
[ 78%] Linking CXX executable ../../bin/parallel
[ 78%] Built target parallel
[ 79%] Building CXX object examples/perplexity/CMakeFiles/perplexity.dir/perplexity.cpp.o
[ 79%] Linking CXX executable ../../bin/perplexity
[ 79%] Built target perplexity
[ 80%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[ 80%] Linking CXX executable ../../bin/quantize
[ 80%] Built target quantize
[ 81%] Building CXX object examples/quantize-stats/CMakeFiles/quantize-stats.dir/quantize-stats.cpp.o
[ 81%] Linking CXX executable ../../bin/quantize-stats
[ 81%] Built target quantize-stats
[ 82%] Building CXX object examples/retrieval/CMakeFiles/retrieval.dir/retrieval.cpp.o
[ 82%] Linking CXX executable ../../bin/retrieval
[ 82%] Built target retrieval
[ 83%] Building CXX object examples/save-load-state/CMakeFiles/save-load-state.dir/save-load-state.cpp.o
[ 83%] Linking CXX executable ../../bin/save-load-state
[ 83%] Built target save-load-state
[ 84%] Building CXX object examples/simple/CMakeFiles/simple.dir/simple.cpp.o
[ 84%] Linking CXX executable ../../bin/simple
[ 84%] Built target simple
[ 85%] Building CXX object examples/passkey/CMakeFiles/passkey.dir/passkey.cpp.o
[ 85%] Linking CXX executable ../../bin/passkey
[ 85%] Built target passkey
[ 86%] Building CXX object examples/speculative/CMakeFiles/speculative.dir/speculative.cpp.o
[ 86%] Linking CXX executable ../../bin/speculative
[ 86%] Built target speculative
[ 87%] Building CXX object examples/lookahead/CMakeFiles/lookahead.dir/lookahead.cpp.o
[ 87%] Linking CXX executable ../../bin/lookahead
[ 87%] Built target lookahead
[ 88%] Building CXX object examples/lookup/CMakeFiles/lookup.dir/lookup.cpp.o
[ 88%] Linking CXX executable ../../bin/lookup
[ 88%] Built target lookup
[ 89%] Building CXX object examples/lookup/CMakeFiles/lookup-create.dir/lookup-create.cpp.o
[ 89%] Linking CXX executable ../../bin/lookup-create
[ 89%] Built target lookup-create
[ 90%] Building CXX object examples/lookup/CMakeFiles/lookup-merge.dir/lookup-merge.cpp.o
[ 90%] Linking CXX executable ../../bin/lookup-merge
[ 90%] Built target lookup-merge
[ 91%] Building CXX object examples/lookup/CMakeFiles/lookup-stats.dir/lookup-stats.cpp.o
[ 91%] Linking CXX executable ../../bin/lookup-stats
[ 91%] Built target lookup-stats
[ 92%] Building CXX object examples/gguf/CMakeFiles/gguf.dir/gguf.cpp.o
[ 92%] Linking CXX executable ../../bin/gguf
[ 92%] Built target gguf
[ 92%] Building CXX object examples/train-text-from-scratch/CMakeFiles/train-text-from-scratch.dir/train-text-from-scratch.cpp.o
[ 93%] Linking CXX executable ../../bin/train-text-from-scratch
[ 93%] Built target train-text-from-scratch
[ 94%] Building CXX object examples/imatrix/CMakeFiles/imatrix.dir/imatrix.cpp.o
[ 94%] Linking CXX executable ../../bin/imatrix
[ 94%] Built target imatrix
[ 94%] Generating json-schema-to-grammar.mjs.hpp
[ 95%] Generating completion.js.hpp
[ 96%] Generating index.html.hpp
[ 96%] Generating index.js.hpp
[ 97%] Building CXX object examples/server/CMakeFiles/server.dir/server.cpp.o
[ 97%] Linking CXX executable ../../bin/server
[ 97%] Built target server
[ 98%] Building CXX object examples/export-lora/CMakeFiles/export-lora.dir/export-lora.cpp.o
[ 98%] Linking CXX executable ../../bin/export-lora
[ 98%] Built target export-lora
[ 98%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o
[ 99%] Linking CXX executable ../../bin/vdot
[ 99%] Built target vdot
[100%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/q8dot
[100%] Built target q8dot
cd ..
cp build/bin/main ./

実行してみる

以下は、q4版の実行結果です。

./main -m Phi-3-mini-4k-instruct-q4.gguf -p "<|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nカレーライスとは何ですか?<|end|>\n<|assistant|>"
Log start
main: build = 2755 (e00b4a8f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1714366482
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32064
llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 96
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32064]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32064]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.16 GiB (4.85 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  2210.78 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   168.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    13.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


<s><|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nカレーライスとは何ですか?<|end|>\n<|assistant|> カレーライスは、日本の伝統的な料理で、カレー(味噌汁の中に練り込まれた甘く熱いスープ)と、一緒に炊いたご飯(白米)を組み合わせた料理です。カレーライスは、味噌(味噌汁の調味料)を用いて作られる一般的な日本料理で、味わい深い味わいと両方の食材のバランスが楽しめます。味噌の甘さと熱さが、炊いたご飯に絶妙に調和しています。カレーライスは、家族や友人の飲み会、お祭りなどでも広く楽しまれています。

また、カレーライスにはさまざまな形で提予されています。一般的な形としては以下のようなものがあります:

1. カレーライスの基本形式: 味噌汁と炊き上げたご飯を組み合わせたもの。
2. カレーライスの様式: 軽食やデザートとして、カレーを小骨やアボカド、フレッシュチーズ、トッピングとして用いることもあります。
3. カレーライスの大型: グローバルなイベントやお正月など、大型のカレーライスを提供する場合があります。

カレーライスは、そのシンプルながらも様々な味わいを楽しむ上に、日本文化におけるコーヒー文化との調和も見て取ることができるでしょう。<|endoftext|> [end of text]

llama_print_timings:        load time =    5351.70 ms
llama_print_timings:      sample time =      13.12 ms /   579 runs   (    0.02 ms per token, 44137.83 tokens per second)
llama_print_timings: prompt eval time =    1091.29 ms /    35 tokens (   31.18 ms per token,    32.07 tokens per second)
llama_print_timings:        eval time =   28521.33 ms /   578 runs   (   49.34 ms per token,    20.27 tokens per second)
llama_print_timings:       total time =   29741.03 ms /   613 tokens
Log end

以下は、fp16版の実行結果です。

./main -m Phi-3-mini-4k-instruct-fp16.gguf -p "<|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nカレーライスとは何ですか?<|end|>\n<|assistant|>"
Log start
main: build = 2755 (e00b4a8f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1714366598
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from Phi-3-mini-4k-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32064
llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 96
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32064]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32064]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 7.12 GiB (16.00 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  7288.51 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   256.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    13.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


<s><|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nカレーライスとは何ですか?<|end|>\n<|assistant|> カレーライスは、カレーとご飯という二つの主要な材料を用いて作られる日本の料理です。カレーとは、甘くて辛いスープのことで、通常に味噌、豚肉、野菜、そして材料を使って作られます。このカレーの味わいが、ご飯に塩味を加えて食べられるときの組み合わせによって大きく変わります。


一般的に、以下のような様式でカレーライスを作ることができます:

1. 牛カレー:牛肉を使用し、よく煮込まれた味わいが楽しめます。

2. 豚カレー:豚肉を主食にしたカレーの一つで、もっと甘みが優れています。

3. 鳥カレー:鶏や鹿肉を使用し、繊細な味わいが特徴です。


カレーライスは日本の食文化の一部であり、家庭でもシンプルながら栄養価が高いメニューとして広く親しまれています。また、カレーライスの種類や材料によって、季節や地域によって異なるバリエーションが存在します。<|end|> [end of text]

llama_print_timings:        load time =     700.89 ms
llama_print_timings:      sample time =      11.02 ms /   469 runs   (    0.02 ms per token, 42570.57 tokens per second)
llama_print_timings: prompt eval time =    1329.39 ms /    35 tokens (   37.98 ms per token,    26.33 tokens per second)
llama_print_timings:        eval time =   59632.28 ms /   468 runs   (  127.42 ms per token,     7.85 tokens per second)
llama_print_timings:       total time =   61069.43 ms /   503 tokens
Log end

おわりに

この他にもいくつかプロンプトを試してみた印象では実行速度はq4のほうが速い。fp16の倍くらいか。回答精度はfp16のほうがやや高いという感じ。ほぼ期待通りでした。

Discussion