🔖

llama.cpp の tokenizer でのユーザー定義 special_token の扱いのメモ

2024/10/18に公開

gpt2(tiktoken)では special_tokens は split 前に regex で処理している
llama.cpp では special tokens の処理には regex は使わず, regex 相当を自前処理している.
https://github.com/ggerganov/llama.cpp/blob/afd9909a6481402844aecefa8a8908afdd7f52f1/src/llama-vocab.cpp#L1437
まず tokenizer_st_partition で, special token が入力テキストにないか探索し, マッチすれば encode(str -> id), なければ fragment buffer に RAW TEXT(未処理)として追加
その後 RAW TEXT については split を行い,
https://github.com/ggerganov/llama.cpp/blob/afd9909a6481402844aecefa8a8908afdd7f52f1/src/unicode.cpp#L653
tokenize し, special tokens の結果と concat して encode する.

Discussion