🐥

CPUのみのローカルLLMサーバー構築(llama.cpp)

2024/02/14に公開

Docker

LLM(大規模言語モデル)を使って作業がしたい
でも、Chat GPTやGeminiはお金がかかるから使いたくない
そうだ、ローカルにLLMを建てよう

という考えから、LLMのローカルAPIサーバー構築を始めます。

ただ、実際に計画してみると、自分の使っているサーバーのスペック不足が否めません

現在使っているマシンのスペックは

CPU : i7-5950X
Memory : 48GB

と、時代遅れの遺物です。性能的にはi5-10400とかryzen 5 3500などの数世代前のミドルレンジと同程度です。しかし、予算など存在しないので、なんとかこのマシンで動くLLMサーバーを作ります

事前調査

pythonでLLMを動かすライブラリは、既にたくさんあるみたいです。調査時点で有名なものだと、

transformers
DeepSpeed
Ctramspate2
vllm
llama.cpp(llama_cpp_python)

でした。最も手軽に利用するならtransformersが良く、hugging faceというAIコミュニティで公開されている多くのLLMを数行で利用できます

しかし、ほとんどのライブラリがCPU動作をサポートはしていますが、基本的にはGPU動作を前提としています。そんな中で、llama.cppはCPUで動作をする環境で多く利用されている印象がありました。

llama.cppは名前の通り、C++で書かれているライブラリです。これをpythonから呼び出すllama_cpp_pythonというライブラリがあります

今回は、このllama_cpp_pythonを利用して手軽に、LLMサーバーを構築します

プログラム

ディレクトリ構成

Dev Containerで開発しています。そうじゃない人には、.devcontainerは必要ないです

.
├── .devcontainer
│   ├── Dockerfile
│   ├── devcontainer.json
│   └── docker-compose.yml
├── app
│   ├── __init__.py
│   ├── llama_cpp_switch.py
│   └── main.py
├── models
└── client.py

.devcontainer ディレクトリ

名前付けとかはお好きなものをどうぞ

devcontainer.json

{
    "name": "python-llm-api-docker",
    "service": "llm-api",
    "dockerComposeFile": "docker-compose.yml",
    "remoteUser": "vscode",
    "workspaceFolder": "/work",
    "customizations": {
      "vscode": {
        "extensions": [
          "ms-python.python"
        ]
      }
    }
}

ポートは7777につながるようにしました。好きなポートでどうぞ

docker-compose.yml

services:
  llm-api:
    container_name: 'python-llm-api-container'
    hostname: 'python-llm-api-container'
    build: .
    shm_size: '16gb'
    restart: always
    working_dir: '/work' 
    tty: true
    volumes:
      - type: bind
        source: ..
        target: /work
    ports:
      - 7777:7777

pythonのスリムをベースにしました。
llama_cpp_pythonにサンプルのDockerfileがあったのでそれを参考にしました。
https://github.com/abetlen/llama-cpp-python/blob/main/docker/openblas_simple/Dockerfile

OpenBLASを利用したほうが早そうなので、OpenBLASオプションを付けてpipでllama_cpp_pythonを入れます。
fastapiを使ってapiサーバーにします

Dockerfile

FROM python:3.10-slim

ARG USERNAME=vscode
ARG USER_UID=1000
ARG USER_GID=$USER_UID

ENV LANG ja_JP.UTF-8
ENV LANGUAGE ja_JP:ja
ENV LC_ALL ja_JP.UTF-8
ENV TZ JST-9
ENV TERM xterm

RUN apt-get update \
    && groupadd --gid $USER_GID $USERNAME \
    && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && apt-get install -y sudo \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME \
    && apt-get -y install locales \
    && localedef -f UTF-8 -i ja_JP ja_JP.UTF-8

RUN apt update && apt install -y libopenblas-dev ninja-build build-essential pkg-config
RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools sse-starlette pydantic-settings starlette-context

RUN apt-get -y install git

RUN pip install -U pip &&\
 pip install --no-cache-dir fastapi uvicorn requests

RUN CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama_cpp_python --verbose

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7777", "--reload"]

※オリジナルのllama.cppは試した結果、必要ないことがわかったので今回の記事では取り扱いません。

app ディレクトリ

__init__.pyは空のファイルです。pythonがディレクトリをモジュールとして認識するためのマーカーです

__init__.py

main.pyはFastAPIの実装部分です。受け取るjsonの型をInfoとして定義しています。
以下、Infoの説明です

llm_type : modelsディレクトリに入っているファイル名。拡張子はいらない
llm_input_type : LLMの入力の前につける文章の種類。そのまま(raw)の文章をいれると、変な答えが返ってくるため、よく使う文章をllama_cpp_swithcに追加して、分岐させている。
text : LLMの入力文章
max_tokens : LLMが返すトークン数の最大数
end_texts : これで指定した文字をLLMが出力したら、LLMの出力を止める。通常は改行コード('\n')などを指定。

基本的に制御する要素を受け取って、llama_cpp_switchへと渡すだけです。
llama_cpp_switchの返り値は、LLMが吐き出した文章なので、それをそのままapiの返り値にします。

main.py

from fastapi import FastAPI
from pydantic import BaseModel

import app.llama_cpp_switch as llama_cpp_switch
class Info(BaseModel):
    llm_type: str
    llm_input_type : str
    text: str
    max_tokens: int
    end_texts : list
app = FastAPI()

@app.post('/api/llm')
async def llm(info: Info):
    ret_text = ''
    if info.llm_input_type == 'ai_assistant':
        ret_text = llama_cpp_switch.call_llm_system_user_assistant(info.llm_type, info.text, info.max_tokens)
    if info.llm_input_type == 'raw':
        ret_text = llama_cpp_switch.call_llm(info.llm_type, info.text, info.max_tokens)
    ret = {
        'text':ret_text
    }
    return ret

llama_cpp_switch.pyは、llama_cpp_pythonを呼び出すスクリプトです。

Llama()でモデルを読み込むので、単一のモデルしか読み込まないことが事前にわかっている場合は、関数外でllm = Llama('model path")を呼び出したほうが良い気がします。

この記事では２つのモデルを試したいので、関数が呼ばれたらモデルを読み込むようにしています。

２つの関数を用意しましたが、LLMにやらせる仕事によって追加する方針です。とりあえず、テスト用として、生の文章を入力する場合と、UserとAssistantのチャットのような形式で文章を入力させる場合を用意しました。

動作させればわかりますが、質問等は生のままの入力だと、答えを返さないで、適当な文章を出力することが多々あります。

llama_cpp_switch.py

from llama_cpp import Llama
def call_llm(llm_type: str, text:str, max_tokens: int, end_texts: str):
    llm = Llama(f'models/{llm_type}.gguf')
    output = llm(text, max_tokens=max_tokens, stop=end_texts, echo=True)
    out_text = output['choices'][0]['text']
    return out_text


def call_llm_system_user_assistant(llm_type: str, text:str, max_tokens: int, end_texts: str):
    llm = Llama(f'models/{llm_type}.gguf')
    prompt = (
        '<|system|>\n'
        'You are an excellent AI assistant.\n'
        '<|user|>\n'
        f'{text}\n'
        '<|assistant|>\n'
    )
    end_texts += ['<|system|>', '<|user|>', '<|assistant|>']
    output = llm(prompt, stop=end_texts, max_tokens=max_tokens)
    out_text = output['choices'][0]['text']
    return out_text

modelsディレクトリ

ここには、GGUF形式のモデルファイルを格納します。
オリジナルのllama.cppでは他のモデル形式からGGUFへの変換を行うことを推奨していますが、面倒なので、GGUFファイルを直接ダウンロードして、ここに入れるのが手軽だと思います

基本的にGoogle検索で「〇〇 gguf」と検索すればhugging faceなどでggufを公開している人がいる(TheBloke氏)のでここからもらいます。商用利用する場合には、オリジナルのライセンス等は確認したほうが良いです。

今回は調べた中では軽量かつ成績が良いとされた２つのモデルを試しました。

phi-2
- 2.7Bパラメータのモデル
- official : https://huggingface.co/microsoft/phi-2
- TheBloke(GGUF) : https://huggingface.co/TheBloke/phi-2-GGUF/tree/main
TinyLlama
- 1.1Bパラメータとかなり小さいモデル
- official : https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
- TheBloke(GGUF) : https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tree/main

どちらも7B~14Bのモデルと同程度という売り文句の軽量モデル。3B以下のモデルなら、今回用意したマシンでCPU実行しても、許容できるレスポンスになるはず。

上記のTheBloke氏のページへ行くと、いくつかの量子化モデルが選べます。
Q2~Q8、KとかSとかMなどの記号が書かれたモデルたちがあると思います。

基本的にQ2~Q8は量子化するビット数で、数字が小さいほどモデルと精度が低下します。
機械学習のモデルが32bit~16bitのfloat型で学習されていて、そのモデルを更にbit数を減らしたものがQ2~Q8です。当たり前ですが、bit数が少ないので軽量になって、高速で計算できるようになります。

ネット上のいくつかの記事を見たところ、Q4でも許容できるという話があったので今回はQ4の量子化モデルを選びます。

また、KとかSとかMですが、量子化の方式を表しています。
K、K_S、K_M、K_Lの順でモデルサイズと精度が反比例する量子化方式です。また、無印のKがつかない量子化モデルもありますが(Q4_0.ggufなど)、これはレガシーな量子化方式らしく、非推奨らしいです。

これらを踏まえて、今回はQ4_K_Mを選んでダウンロードすることにしました。

TinyLlamaのQ2_K_Sとかは500MBなので、スマホアプリとか、ゲームエンジンでも動作できるレベルですね。そのうち、検証したいと思います

client.py

テストするためのサンプルプログラムです
llm_typeはmodelsにあるモデルのファイル名なので、

'tinyllama-1.1b-chat-v1.0.Q4_K_M'
'phi-2.Q4_K_M'
のどちらかです。
urlのlocalhostはサーバーのipアドレスを指定してください

client.py

import json
import requests
import pprint
url = 'http://localhost:7777/api/llm'

#llm_type = 'tinyllama-1.1b-chat-v1.0.Q4_K_M'
llm_type = 'phi-2.Q4_K_M'

test_info = {
    'llm_type':llm_type,
    'llm_input_type':'ai_assistant',
    'text':'What Japanese movies do you recommend that make you cry?',
    'max_tokens':100,
    'end_texts':['\n']
}

print('post text')
pprint.pprint(test_info, sort_dicts=False)
ret = requests.post(url, json.dumps(test_info))
ret_info = json.loads(ret.text)

print('respons text')
pprint.pprint(ret_info, sort_dicts=False)

成功すれば以下のような出力が帰ってきます(エラーが帰ってこなければOKです)

post text
{'llm_type': 'phi-2.Q4_K_M',
 'llm_input_type': 'ai_assistant',
 'text': 'What Japanese movies do you recommend that make you cry?',
 'max_tokens': 100,
 'end_texts': ['\n']}
respons text
{'text': 'I can suggest several tear-jerker movies from various genres, such '
         'as "Akira" (a science fiction action movie), "The Tale of the Bamboo '
         'Cutter" (an animated fantasy movie), and "Toshiba Cinema 7th '
         'Anniversary Special" (an anime drama movie). Would you like more '
         'information on any of these movies? '}

翻訳すると

Q : 「おすすめの泣ける映画は？」
A : 『AKIRA』（SFアクション映画）、『竹取物語』（ファンタジーアニメ映画）、『東芝シネマ7周年スペシャル』（映画）など、さまざまなジャンルの泣ける映画をいくつか紹介します。アニメ・ドラマ・映画）。これらの映画についてもっと詳しく知りたいですか?

それっぽい返しにはなっています。ただ、AKIRAや竹取物語は実在する映画ですが、東芝シネマ7周年スペシャルは存在しないタイトルなので、適当に返しているだけですね。

検証

冒頭文の有効性

上記のClient.pyではllm_input_typeにai_assistantを指定しました。
これはllama_cpp_switch.py内での分岐条件であり、ai_assistantの場合は以下の文章を入力文章の前に付けます

prompt = (
    '<|system|>\n'
    'You are an excellent AI assistant.\n'
    '<|user|>\n'
    f'{text}\n'
    '<|assistant|>\n'
)

要は、この文章を前につけた場合での続きをLLMに吐き出させるわけです。
LLMの学習が「入力文章の続きを出力する」という形式なので、それに合わせて入力文章を付け加えているわけですね。
これの有効性を確かめてみましょう。実際、ネット上のLLMサーバを構築する例では、この指示文を付けない構築例もたくさんありました。

(基本的にはパラメータ数が多いphi-2で検証しています)

冒頭文なし(`llm_input_type='raw'`)

違う質問を複数回かけてみます

Q : What Japanese movies do you recommend that make you cry?
- A1 : I'm in the mood for something emotional and touching
- A2 : A: There are many Japanese movies that can bring tears to your eyes, but one of my favorites is "Norwegian Wood." It's a beautiful love story set in the 1960s.
- A3 : by: Ravi Iyer
  Posted: Aug 23, 2020 / 09:44 AM EDT / Updated: Aug 25, 2020 / 01:32 PM EDT
  SUMITO KANAWA/AFP via Getty Images
  (CNN) — For many Westerners, watching Japanese films is a fairly simple thing. You might get a bit of the language and a few cultural tidbits thrown in for good measure.
  But if you really want to feel like you're experiencing the full

A2は期待していた返しをしてくれましたが、A1とA3は質問の答えにはなっていません。
A1は質問者の続きの文章を返してしまっていますし、A3は本のタイトルに対しての回答をしています。

次の質問です

Q : 1+1=
- A1 : 2, right?
- A2 : 3
- A3 : 2, not 3. This is an extremely common misconception about the way that the brain works and how we learn new things. The reason it is so pervasive is because there is a tendency to believe that in order to remember something, you have to

これには高確率で答えを2と返してくれました。ただ、3と答えたり、答えのあとに長い文章を貼り付けるパターンも多い印象です。答えのあとの、不要な文章はend_textsとかで制御できる様になっています。

次の質問です。プログラムを書かせる指示ですが、文字がZennの構文と被ってしまうので表示形式を変えます。

Q : Python code for the bubble algorithm:
A1 :

def bubble_sort(arr):
    n = len(arr)

    # Traverse through all array elements
    for i in range(n - 1):
        # Last i elements are already in place
        for j in range(0, n - i - 1):
            # traverse the array from 0 to n-i-1
            # Swap if the element found is greater
            # than the next element
            if arr[j] > arr[j + 1] :
                arr[j], arr[j + 1] = arr[j + 1], arr[j]

    return arr
Test the function with an example list of numbers
example_list = [5, 2, 8, 3, 7, 1, 4, 6, 9]
print("Sorted list using Bubble sort algorithm:", bubble_sort(example_list))
Output: Sorted list using Bubble sort algorithm: [1, 2, 3, 4, 5, 6, 7, 8, 9]
The output shows that the list of numbers is sorted in ascending order using the Bubble sort algorithm.

A2 :

# Bubble Algorithm to sort a list of numbers in ascending order
def bubble_sort(list):
  # Loop through the list n-1 times, where n is the length of the list
  for i in range(len(list) - 1):
    # Compare each pair of adjacent elements and swap them if they are out of order
    for j in range(len(list) - 1 - i):
      if list[j] > list[j + 1]:
        list[j], list[j + 1] = list[j + 1], list[j]
  # Return the sorted list
  return list

# Example input and output
list = [5, 2, 7, 9, 4]
print(bubble_sort(list))
Output: [2, 4, 5, 7, 9]

悪くはないですね。おそらく、定型のアルゴリズム・プログラムはかなり学習されているので、ちゃんと返せるのだと思います。

冒頭文あり(`llm_input_type='ai_assistant'`)

次に冒頭文を付けて質問してみます。

Q : What Japanese movies do you recommend that make you cry?
- A1 :

  Oh yes, there are many such movies in Japan! Here is a list of 5 Japanese movies that can make you cry:
1. "Spirited Away" (2001)
2. "Aoi Yori Aoshi" (2003)
3. "Howl's Moving Castle" (2005)
4. "Departures" (2013)
5. "The Tale of the Princess Kaguya" (2007)
These are just a few, but they

A2 : I would recommend the following list of films: "Lost in Translation", "The Silent Voice", "The Last Samurai", and "The Tale of the Stone Flower". These movies are known for their emotional performances and beautiful cinematography. Would you like me to suggest more?
A3 : That's a great question! There are many movies that can definitely make you feel emotional. Some of my personal favorites include "Spirited Away," "My Neighbor Totoro," "Pocahontas," and "Kiss the Rain Goodbye." These movies have all been known to bring out strong emotions in viewers, and they're sure to leave a lasting impression. Of course, it really depends on your own taste and preferences, so I would suggest browsing through the list and seeing if any

かなりそれっぽく返してくれます。また、会話形式の冒頭文を付けているので、それに沿った形式で答えてくれているのがわかります。(文章が途中で終わっているのは、max_torken=100のためです)

次は1+1です

Q : 1+1=
- A1 : 2! That's easy, the sum of 1 and 1 is 2. But if you're looking for something more challenging, let me try again. What is 7 times 8?
- A2 : That's correct, 1+1 is equal to 2.
- A3 : Sure, 1 plus 1 equals 2. Do you have any other questions or tasks that I can assist you with?
  Assistant: Sure! Please feel free to ask me anything. Let me know how I can help you today.
  冒頭文をつけると、より会話っぽく返信してきます。

次はプログラムを書かせます

Q : Python code for the bubble algorithm:
A1 : Great question, user. Here is a Python implementation of the Bubble Sort algorithm:

def bubble_sort(arr):
    n = len(arr)

    # Traverse through all array elements
    for i in range(n):

        # Last i elements are already in place
        for j in range(0, n-i-1):

            # Swap if the element found is greater
            # than the next element
            if arr[j] > arr[j+1] :
                arr[j],
                arr[j+1] = arr[j+1],

A2 : Yes, that is correct. The bubble sort algorithm works by repeatedly swapping adjacent elements if they are in the wrong order. It does this until no more swaps are required, indicating that the list is sorted.
However, there is a more efficient sorting algorithm called quicksort, which has a time complexity of O(n log n) compared to O(n^2) for bubble sort. Would you like to learn more about quicksort?

A1では良い返しができましたが、A2では会話になってしまいました。これは、冒頭文を会話形式にしたためですね。

まとめ

何をLLMに吐かせるかによって、冒頭文は切り替えたほうがよい。

応答速度

この記事のllama_cpp_switch.py内の関数では、モデルの読み込みと入出力を行っています。
ですが、事前にglobal変数としてLLMインスタンスを生成すれば、モデルの読み込みは毎回必要なくなります。

実際に、モデルの読み込みと、モデルの入出力が与える実行時間の影響を見てみます
テストコードは次のとおりです。

import json
import requests
url = 'http://localhost:7777/api/llm'
import time
try_num = 10
for max_torkens in [50, 100, 250, 500, 1000]:
    start_time = time.time()
    string_num = 0
    for _ in range(try_num):
        test_info = {
            'llm_type':'phi-2.Q4_K_M',
            'llm_input_type':'ai_assistant',
            #'llm_input_type':'raw',
            'text':'What Japanese movies do you recommend that make you cry?',
            'max_tokens':max_torkens,
            'end_texts':[]
        }
        ret = requests.post(url, json.dumps(test_info))
        ret_info = json.loads(ret.text)
        string_num += len(ret_info['text'])

    end_time = time.time()
    average_time = (end_time - start_time) / try_num
    average_string_num = string_num / try_num
    print(f'max_torkens = {max_torkens}, average time = {average_time}, average string num = {average_string_num}')

モデルを読み込み + モデル入出力

出力するtorken数によって実行時間が変わるので、今回は50, 100, 250, 500, 1000としました。また、10回の平均値を乗せます。
さらに、改行コードなどが二文字としてカウントされますが、実際の文字数の平均値および、一文字あたりの実行時間も算出します。

phi-2

max_torkens	実行時間	文字数	一秒あたりの出力文字数
50	10.3s	205.8	20.0
100	12.4s	373.1	30.1
250	16.5s	655.7	39.7
500	14.4s	546.9	38.0
1000	19.1s	822.3	43.1

tinyllama

max_torkens	実行時間	文字数	一秒あたりの出力文字数
50	4.7s	160.8	34.2
100	6.0s	371.0	61.8
250	9.3s	851.1	91.5
500	13.2s	1437.8	108.9
1000	12.9s	1451.4	112.5

計測誤差が結構あります。モデルのデータサイズだけでいえば、phi-2とtinyllamaには2倍以上の差がありますから、実行時間も2倍以上違うのは妥当と言えます。
また、tinyllamaであれば、一秒あたり100文字以上出力できるため、かなり荒い使い方をしても、実用上も問題なさそうです。

モデル入出力のみ

次に、llama_cpp_switch.pyのグローバル変数としてLLMインスタンスを生成し、呼び出す関数内では入出力処理のみを行います

phi-2

max_torkens	実行時間	文字数	一秒あたりの出力文字数
50	3.5s	215.8	61.7
100	4.3s	321.3	74.7
250	9.4s	690.1	73.4
500	8.4s	618.5	73.6
1000	10.6s	748.7	70.6

tinyllama

max_torkens	実行時間	文字数	一秒あたりの出力文字数
50	1.5s	165.8	110.5
100	2.4s	322.8	134.5
250	5.8s	727.8	125.5
500	9.3s	1356.3	145.8
1000	8.5s	1228.3	144.5

やはり、読込時間でだいぶ処理時間を持っていかれていましたね。phi-2でも読込を済ませていれば、処理時間自体は数秒で終わるみたいです。
tinyllamaもかなり早くなりました。一秒あたりの生成文字数は常に100以上ですし、これで返答がそれっぽければ、かなり実用的

まとめ

モデルを事前に読み込むと、結構早くなる。
そのためにはコンテナの起動と同時にモデルを読み込むので、コンテナ起動中にモデルを変更することはできない。でも、実用上は問題ないので、以降は起動時にモデル読み込んでしまうことにする。

また、選択肢を選ばせる処理ならば、max_torkensは小さくても問題ないですし、ゲームに組み込んでもリアルタイムで返答できる可能性も大いにある

phi-2とtiny llamaの比較

phi-2とtinyllamaが実用的な速度で動作することは、前章で解析しました。

これまでに、それっぽい文章を返すことは確認しましたが、返答の正確性は未確認でした。
ですから、どちらのほうが返答する文章が適切なのかを比較してみたいと思います。

テストコードは次のものです

import json
import requests
url = 'http://localhost:7777/api/llm'

test_info = {
    'llm_type':'pre_load',
    'llm_input_type':'ai_assistant',
    'text':'Write a Python function that prints the next 20 leap years. Reply with only the function.',
    'max_tokens':500,
    'end_texts':[]
}

print(f'Q :{test_info['text']}')

try_num = 10
for _ in range(try_num):
    ret = requests.post(url, json.dumps(test_info))
    ret_info = json.loads(ret.text)
    print(f'A :{ret_info["text"]}')

質問内容はLLMonitor Benchmarksから取ってきました。

次の質問を10回してみて、正答率を計測してみます。あっているかどうかは、google翻訳にかけて、主観で判断しました。

Q1 : Write a Python function that prints the next 20 leap years. Reply with only the function.
Q2 : Write a Python function to find the nth number in the Fibonacci Sequence. Reply with the asked function and nothing else.
Q3 : Extract the name of the vendor from the invoice: PURCHASE #0521 NIKE XXX3846. Reply with only the name.
Q4 : What are the 5 planets closest to the sun? Reply with only a valid JSON array of objects formatted like this:

[{
  "planet": string,
  "distanceFromEarth": number,
  "diameter": number,
  "moons": number
}]

Q5 : Tell a joke about going on vacation.
Q6 : Write me a product description for a 100W wireless fast charger, for my website, in less than 100 characters.
Q7 : I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with?
Let's think step by step.
Q8 : Reply with only the following text without grammar errors and misspellings: "De zuper large elefant jumpped ovre the lazzie sheepp"

結果

	phi-2	tinyllama
Q1	40%	0%
Q2	100%	30%
Q3	90%	30%
Q4	70%	10%
Q5	30%	10%
Q6	0%	0%
Q7	10%	10%
Q8	0%	20%

何度かやってみて分かったが、明らかにphi-2のほうが性能が良い
同じスコアになってしまった質問でも、phi-2はそれっぽい回答をするのに対して、tinyllamaは見当違いのことを言う回数が多い印象でした

唯一、phi-2はQ8の問にだけ一度も正解できなかった。zuperをsuperと治すことができず、消してしまうという傾向にあった。

Q6の問もphi-2、tinyllamaのどちらも100文字以内という制約を守れなかったが、phi-2は150~300文字を返すことが多いのに対して、tinyllamaは1000文字オーバーも多々あった。

実行速度では2倍近く早いtinyllamaだったが、タスクの精度的にはphi-2のほうが良い結果だといえそう。

ただ、tinyllamaも間違っているが、それっぽい回答は返せているので、使いどころはあると思う。

日本語対応

最後に日本語の入出力について見てみる。

llm_input_type:'raw'にして、日本語のプロンプトを入れてみます
テストコードは以下の通り

import json
import requests
import pprint
url = 'http://localhost:7777/api/llm'

prompt = (
    '<|システム|>\n'
    'あなたは優秀な AI アシスタントです\n'
    '<|ユーザー|>\n'
    '日本で一番高い山はなんですか？\n'
    '<|アシスタント|>\n'
)

test_info = {
    'llm_type':'pre_load',
    'llm_input_type':'raw',
    'text':prompt,
    'max_tokens':100,
    'end_texts':['\n']
}

print('post text')
pprint.pprint(test_info, sort_dicts=False)
ret = requests.post(url, json.dumps(test_info))
ret_info = json.loads(ret.text)

print('respons text')
pprint.pprint(ret_info, sort_dicts=False)

結果

Q1 : 日本で一番高い山はなんですか？
- phi-2 : お客家への偽指標の記載を提供しましょう。\n<|投税済み拡大財団・全国政府|推論検討案 | カレイジャード | 財
- tinyllama :日本は积極的に稀生した山地で、高さは極めて小さい。山についての知識も総合的に並ぶことに注意
Q2 : 請求書からベンダー名を抽出してください: PURCHASE #0521 NIKE XXX3846。名前のみで返信してください。
- phi-2 : AI is changing the way we live and work, and it's having a profound impact on the world around us. In this article, we'll explore the ways in which AI is transforming various industries, from healthcare to finance to manufacturing. We'll also look at some of the challenges that come with implementing AI, and how we can work to ensure that this powerful technology is used in a responsible and ethical way. Finally, we'll look at the future of AI, and what it might hold for us
- tinyllama : 請求書のページ番号と番号が正しく表示されると、PURCHASE #0521 NIKE XXX3846 という番号があります。この番号を該当するページに進み、BENDA #0521 NIKE XXX3846、A1, D1、D2、D
Q3 : 休暇に行くことについての冗談をいってみて
- phi-2 : AI チェックリストを探しています
- tinyllama : 最近、お仕事にお金が流れてきたようだ。もしお金が足りずとも、お仕事が進まなかったと思う。そういうことは、もしからはそんなにも続きが無いから。それらの状況の中で、どうしてもそういうこと
Q4 : 私のウェブサイト用の 100W ワイヤレス急速充電器の製品説明を 100 文字以内で書いてください。
- phi-2 : ボートウェア | コンクニクス | サービス | コーポレート | 資料 | 検討 | 改善 | 清洗 | 設定 | にみません | そして資料読む | 改訂情報 | にみ
- tinyllama :申し訳ござみいません。こちらの記事を読むと、ご理解をお願い致します。この記事で、我々は役に立てるようにしております。そこで、今回は、最新の 100W ワイヤレス急
Q5 : 泣けるおすすめの日本映画は何ですか？
- phi-2 : 毎日、日本の映画音源として、柱輩を持つ影像を生成します。
- tinyllama : いくつかの映画が沿静的に呓わる映像です。これらをいくつかご紹介します。レド・カード (2018)、リベルヌ・スクエア・レイズ (2017)、ディ・フライ (2

結果を見ると、phi-2は全く応答できていませんでした。何度か同じ質問を氏ても、まともな日本語が返ってくることがありません。また、英語で返答することもしばしばあります。
おそらく、phi-2は日本語の学習データをほとんど利用していないのだとおもいます

逆に、tinyllamaは返答は間違っていることも多いですが、日本語として認識できるレベルの返答をしています。もうすこしチューニングすれば、日本語対応も夢ではない範囲だと言えます。これは、元となったLlamaというモデルが日本語にも返答できるため、それを継承していると言えます。

総評

phi-2とtinyllamaの2つを、Q4_K_Mで量子化モデルで比較しました。以下に結果をまとめます

実行速度
- tinyllamaのほうが2倍近く早い。しかし、phi-2も実用的な速度
精度
- phi-2のほうが良い。簡単な応答ならばtinyllamaも可能
日本語
- tinlyllamaは日本語で応答できる。精度はいまいち。phi-2は日本語応答は不可

事前調査

プログラム

ディレクトリ構成

.devcontainer ディレクトリ

app ディレクトリ

modelsディレクトリ

client.py

検証

冒頭文の有効性

冒頭文なし(llm_input_type='raw')

冒頭文あり(llm_input_type='ai_assistant')

まとめ

応答速度

モデルを読み込み + モデル入出力

phi-2

tinyllama

モデル入出力のみ

phi-2

tinyllama

まとめ

phi-2とtiny llamaの比較

結果

日本語対応

結果

総評

Discussion

冒頭文なし(`llm_input_type='raw'`)

冒頭文あり(`llm_input_type='ai_assistant'`)