<h1 id="%E3%81%AF%E3%81%98%E3%82%81%E3%81%AB">
<a class="header-anchor-link" href="#%E3%81%AF%E3%81%98%E3%82%81%E3%81%AB" aria-hidden="true"></a> はじめに</h1>
前日2023年の年末に、Elyzaから「既存のオープンな日本語LLMの中で最高性能」と言われる「ELYZA-japanese-Llama-2-13b」シリーズが公開されました。 
これを「70Bのモデルも 4GB GPU カードで推論できる」とうたっているAirLLMを使うことで、M1 Mac (MacBook Air M1 16GB)で動かしてみました。
<h1 id="elyza-japanese-llama-2-13b%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA">
<a class="header-anchor-link" href="#elyza-japanese-llama-2-13b%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA" aria-hidden="true"></a> ELYZA-japanese-Llama-2-13bシリーズ</h1>
これはMeta 社の「Llama 2」シリーズをベースに、日本語テキストの追加学習を行ったモデル群です。 
ライセンスは Llama 2 Community License に準拠しており、Acceptable Use Policy に従う限りにおいては、研究および商業目的での利用が可能です。
<ul>
<li><a href="https://note.com/elyza/n/n5d42686b60b7" target="_blank" rel="nofollow noopener noreferrer">130億パラメータの「Llama 2」をベースとした日本語LLM「ELYZA-japanese-Llama-2-13b」を公開しました</a></li>
</ul>
今回はシリーズの中から、日本語語彙の追加学習とトークナイザーの効率化、そしてユーザーの指示に従う追加学習を行ったモデルを利用しました。
<ul>
<li><a href="https://huggingface.co/elyza/ELYZA-japanese-Llama-2-13b-fast-instruct" target="_blank" rel="nofollow noopener noreferrer">ELYZA-japanese-Llama-2-13b-fast-instruct</a></li>
</ul>
<h1 id="airllm">
<a class="header-anchor-link" href="#airllm" aria-hidden="true"></a> AirLLM</h1>
AirLLMは、巨大なLLMモデルをメモリが少ないGPUで実行可能にするライブラリです。モデル全体をメモリに載せるのではなく、計算するレイヤーごとに分割してストレージからGPUメモリに載せることで、少ないGPUメモリで実行可能にしているようです（その分オーバーヘッドがあり、実行には時間がかかる）
<ul>
<li><a href="https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb" target="_blank" rel="nofollow noopener noreferrer">Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique</a></li>
<li><a href="https://github.com/lyogavin/Anima/tree/main/air_llm" target="_blank" rel="nofollow noopener noreferrer">GitHub lyogavin/Anima/air_llm</a></li>
</ul>
今回対象としている13Bのモデルなら、十分実行可能なはずです。
またAirLLMはApple siliconをサポートしているのも特徴です。「MLX」というApple silicon用のNumPyライクなライブラリを利用しています。
<ul>
<li>
<a href="https://ml-explore.github.io/mlx/build/html/index.html" target="_blank" rel="nofollow noopener noreferrer">MLX</a>
<ul>
<li><a href="https://github.com/ml-explore/mlx" target="_blank" rel="nofollow noopener noreferrer">GitHub</a></li>
</ul>
</li>
</ul>
MLXを使う場合は微妙に書き方が変わるようです。こちらのノートブックが参考になります。
<ul>
<li><a href="https://github.com/lyogavin/Anima/blob/main/air_llm/examples/run_on_macos.ipynb" target="_blank" rel="nofollow noopener noreferrer">run_on_macos.ipynb</a></li>
</ul>
<h1 id="%E5%8B%95%E3%81%8B%E3%81%97%E3%81%A6%E3%81%BF%E3%82%8B">
<a class="header-anchor-link" href="#%E5%8B%95%E3%81%8B%E3%81%97%E3%81%A6%E3%81%BF%E3%82%8B" aria-hidden="true"></a> 動かしてみる</h1>
Python 3.11.7 で動作させています。
<h2 id="%E3%83%A2%E3%82%B8%E3%83%A5%E3%83%BC%E3%83%AB%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB">
<a class="header-anchor-link" href="#%E3%83%A2%E3%82%B8%E3%83%A5%E3%83%BC%E3%83%AB%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB" aria-hidden="true"></a> モジュールのインストール</h2>
必要なモジュールをインストールします。
<div class="code-block-container"><pre><code>pip install torch torchvision
pip install mlx

pip install airllm
</code></pre></div><h2 id="%E3%82%B5%E3%83%B3%E3%83%97%E3%83%AB%E3%82%B3%E3%83%BC%E3%83%89">
<a class="header-anchor-link" href="#%E3%82%B5%E3%83%B3%E3%83%97%E3%83%AB%E3%82%B3%E3%83%BC%E3%83%89" aria-hidden="true"></a> サンプルコード</h2>
<div class="code-block-container"><pre class="language-py"><code class="language-py"># elyze_airllm_mac.py

from airllm import AutoModel
import mlx.core as mx
import time

# モデル名を指定
#model_name = "elyza/ELYZA-japanese-Llama-2-7b-fast-instruct"
model_name = "elyza/ELYZA-japanese-Llama-2-13b-fast-instruct"

MAX_LENGTH = 128
MAX_NEW_TOKENS = 20

# モデルの準備
model = AutoModel.from_pretrained(model_name)


# -- 入力テキスト --
# input_text = [
# '富士山の高さは？'
# ]
input_text = '富士山の高さは？'

# -- トークナイズ --
def tokenize(model, text):
 input_ids = model.tokenizer(text,
 # return_tensors="pt", # NG
 return_tensors="np", # OK

 return_attention_mask=False,
 truncation=True,
 max_length=MAX_LENGTH,
 #padding=False
 )
 return input_ids

# -- 生成 --
def generate(model, input_ids):
 generation_output = model.generate(
 mx.array(input_ids['input_ids']),
 max_new_tokens=MAX_NEW_TOKENS,
 use_cache=True,
 return_dict_in_generate=True
 )
 return generation_output

# -- 推論の実行 --
def query(model, text):
 input_ids = tokenize(model, text)
 generation_output = generate(model, input_ids)
 return generation_output

# --- main ---
start = time.process_time()
output = query(model, input_text)
end = time.process_time()
print('----------------')
print(output)
print('--- ', end - start, ' sec ---')

</code></pre></div><h2 id="%E5%AE%9F%E8%A1%8C">
<a class="header-anchor-link" href="#%E5%AE%9F%E8%A1%8C" aria-hidden="true"></a> 実行</h2>
初回実行時はモデルがダウンロードされます。私の環境では数十分かかりました。
<ul>
<li>ダウンロード場所: ~/.cache/huggingface/hub/モデル名
</li>
</ul>
結果は次のようになりました。
<div class="code-block-container"><pre><code>- 富士山の高さは3776.12 mです。&lt;/s&gt;&lt;s&gt;
--- 287.62618599999996 sec ---
</code></pre></div>微妙に前後に余計なものが生成されていますが、動かすことができました。実行時間は20トークンで5分弱ということで、実用的には問題ありです。
<h1 id="%E7%B5%82%E3%82%8F%E3%82%8A%E3%81%AB">
<a class="header-anchor-link" href="#%E7%B5%82%E3%82%8F%E3%82%8A%E3%81%AB" aria-hidden="true"></a> 終わりに</h1>
AirLLMのおかげで、自分のM1 Macでも高性能のLLMを動かすことができました。実行時間はまだまだですが、今後も様々な手法/ライブラリが出てきて、普通のローカルマシンでも快適に利用できるようになる日が来ることを期待しています。
<h2 id="%E9%96%A2%E9%80%A3%E8%A8%98%E4%BA%8B">
<a class="header-anchor-link" href="#%E9%96%A2%E9%80%A3%E8%A8%98%E4%BA%8B" aria-hidden="true"></a> 関連記事</h2>
<ul>
<li><a href="https://zenn.dev/mganeko/articles/llamacpp-elyza-mac" target="_blank">M1 Macでllama.cppを使ってElyza-13bを動かしてみた</a></li>
</ul>

M1 MacでAirLLMを使ってElyza-13bを動かしてみた

Discussion