概要
https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer.from_pretrained
AutoTokenizer.from_pretrained
のコードを読んでいく。
実行例
Swallow-7bモデルを使用したケースを想定。
from transformers import AutoTokenizer
model_name = "tokyotech-llm/Swallow-7b-instruct-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
重要そうなメソッド
AutoTokenizer.from_pretrained
ここから開始
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L631-L633
tokenizer_config
tokenizer_configの取得
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L759
こちらを読み込んでいる
https://huggingface.co/tokyotech-llm/Swallow-7b-instruct-hf/blob/main/tokenizer_config.json
config_tokenizer_classのロード
以下のconfigに従ってtokenizer_classを取得。
"tokenizer_class": "LlamaTokenizer",
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L803-L815
use_fastがTrueの場合はサフィックスにFastをつけて検索
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L806
classをインスタンス化
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L815
tokenizer_class_from_name
class_name(文字列)からtokenizerクラスを検索しに行く処理
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L478
このMAPからクラスを探す
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L60
先頭から探していき、最初はこちらがヒットするが
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L194
以下のようにクラスが見つからないので次の設定を探しにいく(なぜかはわからないが)
AttributeError: module transformers.models.idefics has no attribute LlamaTokenizerFast
結局以下の設定がヒットする
'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L214
LlamaTokenizerFast.from_pretrained
tokenizerクラスはこちら
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/llama/tokenization_llama_fast.py#L57
実際に呼ばれるのは継承先のメソッド
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L1803-L1814
fast_tokenizer_fileのロード
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L1970-L1975
resolved_config_file
はtokenizer_config.jsonのパス
fast_tokenizer_filesの設定は無いのでtokenizer_fileは上書きせずそのまま使う。
resolved_vocab_files
{'added_tokens_file': None,
'special_tokens_map_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/special_tokens_map.json',
'tokenizer_config_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/tokenizer_config.json',
'tokenizer_file': None,
'vocab_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/tokenizer.model'}
_from_pretrainedの呼び出し
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L2031-L2042
PreTrainedTokenizerBase._from_pretrained
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L2044-L2057
トーカナイザクラスの生成
結局ここでトーカナイザを生成する
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L2263
init_inputs
init_kwargs
{'__slow_tokenizer': LlamaTokenizer(name_or_path='tokyotech-llm/Swallow-7b-instruct-hf', vocab_size=43176, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
},
'add_bos_token': True,
'add_eos_token': False,
'added_tokens_decoder': {},
'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
'clean_up_tokenization_spaces': False,
'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
'legacy': False,
'model_max_length': 1000000000000000019884624838656,
'name_or_path': 'tokyotech-llm/Swallow-7b-instruct-hf',
'pad_token': None,
'padding_side': 'right',
'sp_model_kwargs': {},
'tokenizer_file': None,
'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
'vocab_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/tokenizer.model'}
LlamaTokenizerFast.init
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/llama/tokenization_llama_fast.py#L111-L123
PreTrainedTokenizerFastの__init__を呼んでいる
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/llama/tokenization_llama_fast.py#L124-L135
PreTrainedTokenizerFast.init
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_fast.py#L94
slowトーカナイザからfastトーカナイザへのコンバートが呼ばれる
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_fast.py#L112-L114
convert_slow_tokenizer
fastトーカナイザへのコンバート
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1359
このMAPでLlamaConverterにより変換が行われる
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1354
LlamaConverter
変換処理
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1190
SpmConverter.init
LlamaConverterのベースクラス
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L500
vocab_fileの読み込み
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L508-L511
self.original_tokenizer.vocab_file
にはtokenizer.modelのファイルパスが入っている
SpmConverter.converted
fastトーカナイザへコンバートしたものを返す
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L578
LlamaConverter.tokenizer
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1227-L1239
model_type
model_type == 2
で処理される
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1227-L1239
huggingface/tokenizers
を呼び出している
tokenizer = Tokenizer(
BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
)
https://github.com/huggingface/tokenizers/tree/main
Discussion