🖊️

AutoTokenizer.from_pretrainedコードリーディング

2024/02/17に公開

概要

https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer.from_pretrained

AutoTokenizer.from_pretrained
のコードを読んでいく。

実行例

Swallow-7bモデルを使用したケースを想定。

from transformers import AutoTokenizer

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

重要そうなメソッド

AutoTokenizer.from_pretrained

ここから開始
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L631-L633

tokenizer_config

tokenizer_configの取得
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L759

こちらを読み込んでいる
https://huggingface.co/tokyotech-llm/Swallow-7b-instruct-hf/blob/main/tokenizer_config.json

config_tokenizer_classのロード

以下のconfigに従ってtokenizer_classを取得。

"tokenizer_class": "LlamaTokenizer",

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L803-L815

use_fastがTrueの場合はサフィックスにFastをつけて検索
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L806

classをインスタンス化
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L815

tokenizer_class_from_name

class_name(文字列)からtokenizerクラスを検索しに行く処理
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L478

このMAPからクラスを探す
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L60

先頭から探していき、最初はこちらがヒットするが
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L194

以下のようにクラスが見つからないので次の設定を探しにいく(なぜかはわからないが)

AttributeError: module transformers.models.idefics has no attribute LlamaTokenizerFast

結局以下の設定がヒットする
'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/auto/tokenization_auto.py#L214

LlamaTokenizerFast.from_pretrained

tokenizerクラスはこちら
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/llama/tokenization_llama_fast.py#L57

実際に呼ばれるのは継承先のメソッド
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L1803-L1814

fast_tokenizer_fileのロード

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L1970-L1975

resolved_config_file
はtokenizer_config.jsonのパス

fast_tokenizer_filesの設定は無いのでtokenizer_fileは上書きせずそのまま使う。

resolved_vocab_files

{'added_tokens_file': None,
 'special_tokens_map_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/special_tokens_map.json',
 'tokenizer_config_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/tokenizer_config.json',
 'tokenizer_file': None,
 'vocab_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/tokenizer.model'}

_from_pretrainedの呼び出し

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L2031-L2042

PreTrainedTokenizerBase._from_pretrained

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L2044-L2057

トーカナイザクラスの生成

結局ここでトーカナイザを生成する
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_base.py#L2263

init_inputs

()

init_kwargs

{'__slow_tokenizer': LlamaTokenizer(name_or_path='tokyotech-llm/Swallow-7b-instruct-hf', vocab_size=43176, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
},
 'add_bos_token': True,
 'add_eos_token': False,
 'added_tokens_decoder': {},
 'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 'clean_up_tokenization_spaces': False,
 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 'legacy': False,
 'model_max_length': 1000000000000000019884624838656,
 'name_or_path': 'tokyotech-llm/Swallow-7b-instruct-hf',
 'pad_token': None,
 'padding_side': 'right',
 'sp_model_kwargs': {},
 'tokenizer_file': None,
 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 'vocab_file': '/root/.cache/huggingface/hub/models--tokyotech-llm--Swallow-7b-instruct-hf/snapshots/eab84acc130b253061ae7ffb88f254c1d496fcfd/tokenizer.model'}

LlamaTokenizerFast.init

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/llama/tokenization_llama_fast.py#L111-L123

PreTrainedTokenizerFastの__init__を呼んでいる
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/models/llama/tokenization_llama_fast.py#L124-L135

PreTrainedTokenizerFast.init

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_fast.py#L94

slowトーカナイザからfastトーカナイザへのコンバートが呼ばれる

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/tokenization_utils_fast.py#L112-L114

convert_slow_tokenizer

fastトーカナイザへのコンバート
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1359

このMAPでLlamaConverterにより変換が行われる
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1354

LlamaConverter

変換処理
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1190

SpmConverter.init

LlamaConverterのベースクラス
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L500

vocab_fileの読み込み

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L508-L511

self.original_tokenizer.vocab_file
にはtokenizer.modelのファイルパスが入っている

SpmConverter.converted

fastトーカナイザへコンバートしたものを返す
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L578

LlamaConverter.tokenizer

https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1227-L1239

model_type

model_type == 2
で処理される
https://github.com/huggingface/transformers/blob/3de6a6b4936229e3b4467dd7de1c24f2fae64528/src/transformers/convert_slow_tokenizer.py#L1227-L1239

huggingface/tokenizers
を呼び出している

tokenizer = Tokenizer(
    BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
)

https://github.com/huggingface/tokenizers/tree/main

Discussion