Closed2025/01/05にクローズ2

今更ながら「Transformers」に入門する ③TUTORIALS: Load pretrained instances with an AutoClass

Python

transformers

Hugging Face

kun432

以下の続き

kun432

チュートリアル: Load pretrained instances with an AutoClass

Transformerのアーキテクチャは複数存在する。AutoClassを使うことで、チェックポイントからアーキテクチャを自動的に推論して、from_pretrained()を使うことで、モデルを利用することができる。

アーキテクチャ
- モデルの構造
- 例: BERT
チェックポイント
- 特定のアーキテクチャに基づいてトレーニングされたモデルの重みやパラメータ
- 例: google-bert/bert-base-uncased

これにより、

アーキテクチャに依存しないシンプルなコードを書ける
from_pretrained()でモデルをすぐにロードできる
タスクが似ていれば、アーキテクチャが異なっても、同じコードで扱える。

というメリットがある。

以下を使う。

AutoTokenizer: 事前学習済みトークナイザをロードする。
AutoImageProcessor: 事前学習済み画像プロセッサをロードする。
AutoFeatureExtractor: 事前学習済み特徴量抽出器をロードする。
AutoProcessor: 事前学習済みプロセッサをロードする。
AutoModel: 事前学習済みモデルをロードする。

`AutoTokenizer`

トークナイザはNLPタスクの基本。入力をモデルで処理できる形式に変換してくれる。

モデルは以下使用

トークナイザをロード

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-jpn-it")

文字列をトークナイズ

sequence = "地面の穴の中にホビットが住んでいました。"
print(tokenizer(sequence))

出力

{
    'input_ids': [2, 81166, 235372, 238442, 81731, 236157, 151326, 235425, 236228, 13359, 13383, 235362],
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

`AutoImageProcessor`

ビジョンタスクでは画像プロセッサを使って画像を入力形式に変換する。

モデルは以下を使用

from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")

この画像プロセッサはPILイメージを入力で受けるようになっているので、以下のようにすれば変換できる。画像は上の方で使用した物を使う。

from PIL import Image
import requests
import io

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(io.BytesIO(requests.get(image_url).content))

image_processor(image)

出力

{'pixel_values': [array([[[-0.05882353, -0.10588235, -0.21568626, ...,  0.19215691,
          0.1686275 ,  0.09803927],
        [-0.02745098, -0.0745098 , -0.17647058, ...,  0.2313726 ,
          0.20000005,  0.12941182],
        [-0.01176471, -0.06666666, -0.17647058, ...,  0.26274514,
          0.22352946,  0.16078436],
        ...,
        [ 0.79607844,  0.79607844,  0.8039216 , ..., -0.8666667 ,
         -0.8666667 , -0.8352941 ],
        [ 0.8039216 ,  0.8039216 ,  0.8039216 , ..., -0.7647059 ,
         -0.8117647 , -0.8117647 ],
        [ 0.8117647 ,  0.81960785,  0.8117647 , ..., -0.7176471 ,
         -0.79607844, -0.8117647 ]],

       [[-0.03529412, -0.08235294, -0.19215685, ...,  0.15294123,
          0.12941182,  0.05882359],
        [-0.00392157, -0.05098039, -0.15294117, ...,  0.19215691,
          0.16078436,  0.09019613],
        [ 0.01176476, -0.04313725, -0.15294117, ...,  0.22352946,
          0.18431377,  0.12156868],
        ...,
        [ 0.84313726,  0.84313726,  0.8509804 , ..., -0.90588236,
         -0.90588236, -0.8745098 ],
        [ 0.8509804 ,  0.8509804 ,  0.8509804 , ..., -0.8039216 ,
         -0.8509804 , -0.8509804 ],
        [ 0.85882354,  0.8666667 ,  0.85882354, ..., -0.75686276,
         -0.8352941 , -0.8509804 ]],

       [[ 0.03529418, -0.01176471, -0.12156862, ...,  0.12156868,
          0.09803927,  0.02745104],
        [ 0.06666672,  0.0196079 , -0.08235294, ...,  0.16078436,
          0.12941182,  0.05882359],
        [ 0.082353  ,  0.02745104, -0.08235294, ...,  0.19215691,
          0.15294123,  0.09019613],
        ...,
        [ 0.9372549 ,  0.9372549 ,  0.94509804, ..., -0.92941177,
         -0.92941177, -0.8980392 ],
        [ 0.94509804,  0.94509804,  0.94509804, ..., -0.827451  ,
         -0.8745098 , -0.8745098 ],
        [ 0.9529412 ,  0.9607843 ,  0.9529412 , ..., -0.78039217,
         -0.85882354, -0.8745098 ]]], dtype=float32)]}

`AutoFeatureExtractor`

オーディオタスクでは特徴量抽出器がオーディオ信号を入力に変換する。

モデルは以下を使用

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "rinna/japanese-wav2vec2-base"
)

この特徴量抽出器はsoundfileのオーディオデータとサンプリングレートを入力に変換する。

import soundfile as sf

raw_speech, sr= sf.read(ds[18]["audio"]["path"])
feature_extractor(
    raw_speech,
    sampling_rate=sr
)

出力

{
    'input_values': array(
        [[4.1687012e-02, 3.1616211e-02, 1.4770508e-02, ..., 1.4648438e-03,
        6.1035156e-04, 6.1035156e-05]],
        dtype=float32
    )
}

`AutoProcessor`

マルチモーダルタスクの場合は２つの前処理（テキストを処理するトークナイザと画像を処理する画像プロセッサなど。）が必要になる。AutoProcessorはこれをやってくれる。

モデルは以下を使用

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct"
)
processor

出力

Qwen2VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 12845056,
    "min_pixels": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2-VL-2B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

{
  "processor_class": "Qwen2VLProcessor"
}

画像を処理するimage_processorとテキストを処理するtokenizerで構成されているのがわかる。

`AutoModel`

AutoModelForは特定のタスクに対してモデルをロードする。タスクの一覧は以下にある。

例えば、以下のモデルを例にする。

文章の感情分析を行うタスク向けにモデルを読み込むにはAutoModelForSequenceClassificationを使う。

from transformers import AutoModelForSequenceClassification

model_name = "christian-phu/bert-finetuned-japanese-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

実際にタスクを処理する

from transformers import AutoTokenizer
from torch import nn

tokenizer = AutoTokenizer.from_pretrained(model_name)

batch = tokenizer(
    [
        "私たちは🤗Transformersライブラリをお見せできてとても嬉しいです。",
        "Transformers、覚えないといけないことが多いし難しすぎて泣いています😭"
    ],
    padding=True,
    max_length=512,
    truncation=True,
    return_tensors="pt",
)

outputs = model(**batch)

predictions = nn.functional.softmax(outputs.logits, dim=-1)
predictions

各文章ごとにラベルごとの確率が返される。

出力

tensor([[1.1796e-03, 8.3060e-04, 9.9799e-01],
        [5.8820e-01, 3.7683e-01, 3.4977e-02]], grad_fn=<SoftmaxBackward0>)

整形するとこう

for i, pred in enumerate(predictions):
    print(f"結果:")
    for label_id, score in enumerate(pred):
        label = model.config.id2label[label_id]
        print(f"  - {label}: {score.item():.4f}")
    print("-" * 30)

出力

結果:
  - negative: 0.0012
  - neutral: 0.0008
  - positive: 0.9980
------------------------------
結果:
  - negative: 0.5882
  - neutral: 0.3768
  - positive: 0.0350
-----------------------------

確率が最も高いものだけを出力

for i, pred in enumerate(predictions):
    print(f"結果:")
    max_idx = pred.argmax().item()          # 最大値のインデックスを取得
    label = model.config.id2label[max_idx]  # 最大値のラベルを取得
    score = pred[max_idx].item()            # 最大値のスコアを取得
    print(f"- {label}: {score:.4f}")

出力

結果:
- positive: 0.9980
結果:
- negative: 0.5882

異なるタスクとして、トークン単位の分類を行うAutoModelForTokenClassificationで読み込んでみる。

from transformers import AutoModelForTokenClassification

model_name = "christian-phu/bert-finetuned-japanese-sentiment"
model = AutoModelForTokenClassification.from_pretrained(model_name)

同じように処理してみる。

from torch import nn
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

batch = tokenizer(
    [
        "私たちは🤗Transformersライブラリをお見せできてとても嬉しいです。",
        "Transformers、覚えないといけないことが多いし難しすぎて泣いています😭"
    ],
    padding=True,
    max_length=512,
    truncation=True,
    return_tensors="pt",
)

outputs = model(**batch)

predictions = nn.functional.softmax(outputs.logits, dim=-1)
predictions

出力

tensor([[[0.3924, 0.1890, 0.4186],
         [0.2241, 0.1771, 0.5989],
         [0.2949, 0.2189, 0.4862],
         [0.2863, 0.2604, 0.4533],
         [0.5581, 0.1443, 0.2975],
         [0.3131, 0.1877, 0.4992],
         [0.2690, 0.1666, 0.5644],
         [0.2658, 0.3378, 0.3964],
         [0.4999, 0.2145, 0.2857],
         [0.4075, 0.2503, 0.3422],
         [0.3456, 0.3259, 0.3286],
         [0.2988, 0.3082, 0.3930],
         [0.1673, 0.3859, 0.4468],
         [0.1827, 0.3970, 0.4204],
         [0.1820, 0.3809, 0.4371],
         [0.1766, 0.3636, 0.4598],
         [0.1501, 0.3488, 0.5011],
         [0.1430, 0.2010, 0.6560],
         [0.1613, 0.1852, 0.6535],
         [0.2068, 0.1742, 0.6190],
         [0.5156, 0.1204, 0.3640],
         [0.4314, 0.1555, 0.4131],
         [0.3847, 0.2068, 0.4085],
         [0.3545, 0.2313, 0.4142]],

        [[0.3745, 0.2616, 0.3639],
         [0.2159, 0.2800, 0.5041],
         [0.2261, 0.1882, 0.5857],
         [0.1607, 0.4639, 0.3754],
         [0.1659, 0.3516, 0.4825],
         [0.2429, 0.2544, 0.5027],
         [0.1758, 0.2804, 0.5438],
         [0.2723, 0.2899, 0.4378],
         [0.4572, 0.2564, 0.2864],
         [0.2529, 0.3033, 0.4439],
         [0.2822, 0.3117, 0.4062],
         [0.2478, 0.3497, 0.4025],
         [0.3709, 0.3147, 0.3144],
         [0.1435, 0.2563, 0.6002],
         [0.2913, 0.3581, 0.3506],
         [0.2262, 0.4104, 0.3634],
         [0.1470, 0.4694, 0.3836],
         [0.1868, 0.4808, 0.3324],
         [0.4322, 0.2330, 0.3347],
         [0.1726, 0.2556, 0.5718],
         [0.2644, 0.4158, 0.3199],
         [0.2390, 0.4206, 0.3404],
         [0.3937, 0.2586, 0.3477],
         [0.1755, 0.2328, 0.5917]]], grad_fn=<SoftmaxBackward0>)

出力だけ見ると、なんとなくトークン単位で感情分析の結果が出ているようにも見える。

一応出力してみる。

tokens = tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False)  # トークンIDを復元
for i, sentence in enumerate(predictions):  # 各文ごとの結果
    print(f"文 {i+1}:")
    for j, token_scores in enumerate(sentence):  # 各トークンごとのスコア
        token_id = batch["input_ids"][i][j].item()  # トークンIDを整数に変換
        token = tokenizer.convert_ids_to_tokens(token_id)  # トークンを取得
        max_idx = token_scores.argmax().item()  # 最大スコアのインデックスを取得
        label = model.config.id2label[max_idx]  # インデックスからラベルを取得
        score = token_scores[max_idx].item()  # 最大スコアの値
        print(f"  トークン: {token}, ラベル: {label}, 確率: {score:.4f}")
    print("-" * 30)

出力

文 1:
  トークン: [CLS], ラベル: positive, 確率: 0.4186
  トークン: 私, ラベル: positive, 確率: 0.5989
  トークン: たち, ラベル: positive, 確率: 0.4862
  トークン: は, ラベル: positive, 確率: 0.4533
  トークン: [UNK], ラベル: negative, 確率: 0.5581
  トークン: Trans, ラベル: positive, 確率: 0.4992
  トークン: ##form, ラベル: positive, 確率: 0.5644
  トークン: ##ers, ラベル: positive, 確率: 0.3964
  トークン: ライブラリ, ラベル: negative, 確率: 0.4999
  トークン: を, ラベル: negative, 確率: 0.4075
  トークン: お, ラベル: negative, 確率: 0.3456
  トークン: 見せ, ラベル: positive, 確率: 0.3930
  トークン: でき, ラベル: positive, 確率: 0.4468
  トークン: て, ラベル: positive, 確率: 0.4204
  トークン: とても, ラベル: positive, 確率: 0.4371
  トークン: 嬉, ラベル: positive, 確率: 0.4598
  トークン: ##しい, ラベル: positive, 確率: 0.5011
  トークン: です, ラベル: positive, 確率: 0.6560
  トークン: 。, ラベル: positive, 確率: 0.6535
  トークン: [SEP], ラベル: positive, 確率: 0.6190
  トークン: [PAD], ラベル: negative, 確率: 0.5156
  トークン: [PAD], ラベル: negative, 確率: 0.4314
  トークン: [PAD], ラベル: positive, 確率: 0.4085
  トークン: [PAD], ラベル: positive, 確率: 0.4142
------------------------------
文 2:
  トークン: [CLS], ラベル: negative, 確率: 0.3745
  トークン: Trans, ラベル: positive, 確率: 0.5041
  トークン: ##form, ラベル: positive, 確率: 0.5857
  トークン: ##ers, ラベル: neutral, 確率: 0.4639
  トークン: 、, ラベル: positive, 確率: 0.4825
  トークン: 覚え, ラベル: positive, 確率: 0.5027
  トークン: ない, ラベル: positive, 確率: 0.5438
  トークン: と, ラベル: positive, 確率: 0.4378
  トークン: いけ, ラベル: negative, 確率: 0.4572
  トークン: ない, ラベル: positive, 確率: 0.4439
  トークン: こと, ラベル: positive, 確率: 0.4062
  トークン: が, ラベル: positive, 確率: 0.4025
  トークン: 多い, ラベル: negative, 確率: 0.3709
  トークン: し, ラベル: positive, 確率: 0.6002
  トークン: 難, ラベル: neutral, 確率: 0.3581
  トークン: ##し, ラベル: neutral, 確率: 0.4104
  トークン: すぎ, ラベル: neutral, 確率: 0.4694
  トークン: て, ラベル: neutral, 確率: 0.4808
  トークン: 泣い, ラベル: negative, 確率: 0.4322
  トークン: て, ラベル: positive, 確率: 0.5718
  トークン: い, ラベル: neutral, 確率: 0.4158
  トークン: ます, ラベル: neutral, 確率: 0.4206
  トークン: [UNK], ラベル: negative, 確率: 0.3937
  トークン: [SEP], ラベル: positive, 確率: 0.5917
------------------------------

結果を見る限りは正しくはないと思うのだけども、少なくとも同じチェックポイントを使って異なるタスクが実行できるというのがわかる。

実際にこれができるかどうかはモデルの作りによるのだと思うが、Transformersの設計としてはこういう風になっているのだと自分は認識した。

なお、LLMのテキスト生成でよく見かけるのはAutoModelForCausalLMが多いように思う。

このスクラップは2025/01/05にクローズされました