Open2023/10/29にコメント追加5

Hugging Face transformers の公式ドキュメントを読む

transformers

Hugging Face

hassaku63

出典: https://huggingface.co/docs/transformers/index

参考書籍: https://www.oreilly.co.jp/books/9784873119953/

どのようなエコシステムがあるのかは書籍の方でだいたいわかっているので省略。

コードを書く場合に押さえておきたい概念を履修するために書く。

概念の方に興味があるので、Conceptual Guides に絞って眺める。

hassaku63

Hugging Face transformers のドキュメントをざっと眺める

出典: https://huggingface.co/docs/transformers/index

参考書籍: https://www.oreilly.co.jp/books/9784873119953/

どのようなエコシステムがあるのかは書籍の方でだいたいわかっているので省略。

コードを書く場合に押さえておきたい概念を履修するために書く。

概念の方に興味があるので、Conceptual Guides に絞って眺める。

https://huggingface.co/docs/transformers/philosophy

Philosophy

誰のための transformer ?

transformer モデルを使っている研究者、教育者
モデルを fine-tune して利用したい人
事前学習されたモデルをダウンロードして、特定のタスクを解きたい人

hugging face transformer は2つの強い目的で設計されている。

(1) 簡単に、早く使えること

User-facing している抽象化を強く制限している。ほとんどのモデルは configuration, models, および前処理用のクラス (tokenizer/ image processor/ feature extractor/ processor) しか必要としない。

↑それらのクラスは、事前学習された状態で from_pretrained() メソッドで（必要があれば）ダウンロードして使う。モデルに関係するデータ (ハイパーパラメータ、トーカナイザの語彙、モデルの重み、など) も一緒にロードされるし、手元にキャッシュされる。これらのデータは、Hugging Face Hub にチェックポイントとして保存されてるものである。

↑それらのクラスの共通に加えて、pipeline() と Trainer という API を提供する。pipeline は特定のタスクに対して推論を素早く使えるようにするもので、Trainer は手早く PyTorch モデルを fine-tune するためのもの。なお、TensorFlow モデルは Keras.fit と互換性があるらしい）

transformer は、ニューラルネットワークを構築するための Building block にはならない。そういうことがしたければ、PyTorch や TwnsorFlow, Keras を使うこと。

コーディングレベルのより詳細なことを知りたい場合は、Repeat yourself を読むとよい

(2) オリジナルのモデルに可能な限り近い性能を持つ、最新のモデルを提供する

少なくとも1つ、再現用のサンプルを提供する。サンプルコードはオリジナルに可能な限り近づけるようにしたが、PyTorch の実装を TensorFlow に（あるいはその逆も）変換したりしているので少々 Pythonic ではない可能性がある。

上記で挙げたほか、いくつかのゴールがある。（省略する）

Main conpects

前述のように、3つのクラスがある。

(1) Model classes

事前学習された重みを持つ、ライブラリから提供された次の3つのモデルのうちのいずれか。

PyTorch models (torch.nn.Module)
Keras models (tf.keras.Model)
JAX/Flax models (flax.linen.Module)

(2) Configuration classes

モデル構築に必要なハイパーパラメータを格納するもの。例えばレイヤーや隠れ層の数。

しかし、transformers のユーザーは必ずしもこのパラメータを自分でインスタンス化する必要がない。何も設定変更せずに事前学習済みのモデルを利用する場合、一部の Model は Configuration のインスタンス化を自動的に行ってくれる。

(3) Preprocessing classes

モデルが受け入れできるデータに変換するためのクラス。

例えば tokenizer なら語彙の保存と文字列の encode/decode を提供するし、image processor ならビジョンの入力を前処理するし、feature extractor は音声入力を前処理する。processor はマルチモーダル（複合的）な入力を扱える。

これらのクラスは、すべて事前学習済みの状態からインスタンス化される。Hub からモデルを引っ張ってきたり、その逆もできる。そのへんに関係する機能として from_pretrained() / save_pretrained() / push_to_hub() の3つがある

hassaku63

Glossary （用語集）

サンプルコードも載ってるので、使用例を見ててわからないことがあったらここを当たると良さげ。

具体的なものを紹介するのは、ここでは省く。

hassaku63

What 🤗 Transformers can do

https://huggingface.co/docs/transformers/task_summary

Transformers は事前学習された transformer ベースのモデルを扱うライブラリだが、transformer ではないもの (例えば CNN) も扱っている。

※文章の文脈からすると、panoptic segmentation タスクは非 transformer なモデルを使っていると述べている、ように見えた。このセクションでは、写真から「背景」を取り除くようなタスク (これは panoptic segmentation と呼ばれる種類のタスクらしい) が紹介されており、そうしたタスクにも transformers が対応していると謳っているらしい。

Audio

音声を処理するタスクは入力がアナログであるため、NLP のように明確なチャンクに分割できない。これは他の種類の modality とは異なる特徴である。

以前は音声からなんらかの特徴量を抽出する前処理を挟んでいた。今は直接信号を取り込んで、Feature encoder に供給できるようになっていて、こちらが主流のアプローチになっている。このおかげで、前処理のステップが簡略化され、モデルが最も重要な特徴を学習できるようになっている。

Audio classification

特定の多くの応用分野を含む、幅広いジャンルらしい。たとえば、

音響シーン分類 ...「オフィス」「ビーチ」「スタジアム」などのラベルを付ける
音響イベント検出 ... 特定の音に「クラクション」などのラベルを付ける
タグ付け ... 複数の音を含む音声にラベルを付ける（鳥の鳴き声、会議中の発話者）
音楽分類 ... ジャンルのラベルを付ける

>>> from transformers import pipeline

>>> classifier = pipeline(task="audio-classification", model="superb/hubert-base-superb-er")
>>> preds = classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> preds
[{'score': 0.4532, 'label': 'hap'},
 {'score': 0.3622, 'label': 'sad'},
 {'score': 0.0943, 'label': 'neu'},
 {'score': 0.0903, 'label': 'ang'}]

Automatic speech recognition (ASR)

音声 -> テキストへの変換。最も一般的なタスクのひとつ。

transformer の貢献は、リソースの少ない言語を扱えるようになったこと、らしい。

>>> from transformers import pipeline

>>> transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")
>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

Computer vision

次のような方法で、Computer vision のタスクを解決する、らしい。

畳み込みを利用して、低レベルの特徴から高レベルの特徴まで、階層的（？）な特徴を学習する
画像をより小さなパッチに分割し、transformer を利用して各パッチがどう関係しあって画像を構成するかを徐々に学習する

Image classification

分類タスク。多くの応用例がある。

ヘルスケア ... 医療画像にラベルを付ける（病気の検出、あるいは健康状態の監視など）
環境 ... 衛生画像にラベルをつけて、森林破壊や山火事などを検出したり、その環境の管理者に情報提供したり
農業 (agriculture) ... 作物の健康状態を監視するためのラベル付け
生態学 (ecology) ... 野生生物の個体数の監視や、絶滅危惧種の追跡

>>> from transformers import pipeline

>>> classifier = pipeline(task="image-classification")
>>> preds = classifier(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> print(*preds, sep="\n")
{'score': 0.4335, 'label': 'lynx, catamount'}
{'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}
{'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}
{'score': 0.0239, 'label': 'Egyptian cat'}
{'score': 0.0229, 'label': 'tiger cat'}

Object detection

分類と異なるのは、画像から複数のオブジェクトを識別できること。識別結果は bounding box で定義される。

自動運転 ... 他の車両、歩行者、信号機などの交通物体の検出
リモートセンシング ... 災害の監視、都市計画、天気予報
欠陥検出 ... 建物の亀裂など、構造的な損傷の検出

>>> from transformers import pipeline

>>> detector = pipeline(task="object-detection")
>>> preds = detector(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds]
>>> preds
[{'score': 0.9865,
  'label': 'cat',
  'box': {'xmin': 178, 'ymin': 154, 'xmax': 882, 'ymax': 598}}]

Image segmentation

ピクセルレベルのタスク。各ピクセルに「クラス」を割り当てる。Bounding box を利用するオブジェクト検出よりも、セグメンテーションの粒度がより細かい。

Instance segmentation ... オブジェクトにラベルを貼るだけではなく、別々のオブジェクト（インスタンス）にそれぞれ ID を振る（例えば、2人が映っている画像に human-1, human-2 を割り当てる）
panoptic segmentation ... "Semantic segmentation" と "Instance segmentation" の組み合わせ。各ピクセルに、そのオブジェクトのクラスと ID を割り当てる

セグメンテーションタスクの応用例は、例えば自動運転車が周囲の世界を認識するため、あるいは医療画像の処理に役立つ。

e コマースの世界でも用いられる。例えば、仮想的な試着サービス。

>>> from transformers import pipeline

>>> segmenter = pipeline(task="image-segmentation")
>> preds = segmenter(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> print(*preds, sep="\n")
{'score': 0.9879, 'label': 'LABEL_184'}
{'score': 0.9973, 'label': 'snow'}
{'score': 0.9972, 'label': 'cat'}

Depth estimation

各ピクセルが、カメラからどの程度離れているかを推定する。このタスクは、「シーンの理解」や「認識」にとって特に重要。

例えば、自動運転において物体との衝突を回避するためには距離の認識が必要。

また、奥行きを理解することは 2D から 3D の世界を再構築する際にも非常に重要な情報となる。

推定の方法には2つのアプローチがある。

stereo ... わずかに異なる2つの確度から見た同じ画像を比較することで奥行きを推定
monocular ... 単一の画像から推定を試みる

from transformers import pipeline

depth_estimator = pipeline(task="depth-estimation")
preds = depth_estimator(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)

Natural language processing (NLP)

最も一般的なタスクのひとつ。モデルがテキストを扱えるようにするために "Tokenize" が必要。

Text classification

1つの文書（あるは段落、文）に対してラベルを貼るもの。典型的な応用例は、

感情分析
コンテンツ分類 ... 例えばニュースフィードのジャンル分類

>>> from transformers import pipeline

>>> classifier = pipeline(task="sentiment-analysis")
>>> preds = classifier("Hugging Face is the best thing since sliced bread!")
>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
>>> preds
[{'score': 0.9991, 'label': 'POSITIVE'}]

Token classification

トークンレベルで、事前定義されたラベルを割り振るもの。一般に、次の2つの型がある

固有表現認識 = Named enitty recognition (NER) ... 組織、人名、場所、日付など、カテゴリを割り当てる。例えば医療分野では、遺伝子やタンパク質、薬剤名に対してラベルをつけられるため、人気がある
品詞分類 = Part-of-speech taggign (POS) ... 名詞、動詞、形容詞などの分類を行う。翻訳システムが、（使用文脈がことなる）同一の単語がどのように異なるか認識するのに役立つ。例えば、bank のように同じスペルで名詞 or 動詞としての役割がある単語の理解

>>> from transformers import pipeline

>>> classifier = pipeline(task="ner")
>>> preds = classifier("Hugging Face is a French company based in New York City.")
>>> preds = [
    {
        "entity": pred["entity"],
        "score": round(pred["score"], 4),
        "index": pred["index"],
        "word": pred["word"],
        "start": pred["start"],
        "end": pred["end"],
    }
    for pred in preds
]
>>> print(*preds, sep="\n")
{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24}
{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50}
{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55}

Question answering

トークンレベルのタスクの1つ。コンテキストあり／なしの両方のパターンがある (open-domain or closed-domain)

Question answering には一般に次の2パターンがある（原文のニュアンスはよくわからなかった）

抽出的 ... xxx
抽象的 ... この目的には Text2TextGenerationPipeline というやつを使えるらしい。サンプルとして提示されてる Text2TextGenerationPipeline とは異なる模様

>>> from transformers import pipeline

>>> question_answerer = pipeline(task="question-answering")
>>> preds = question_answerer(
    question="What is the name of the repository?",
    context="The name of the repository is huggingface/transformers",
)
>>> print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)
score: 0.9327, start: 30, end: 54, answer: huggingface/transformers

Summarization

元の意味を壊さないようにしつつ、より短いバージョンを生成する。seq-to-seq のタスク。法案、法律文書、財務文書、特許、科学論文などは、読者の時間を節約し、読書補助として要約できる一例。

質問応答と同じで2つのタイプがある。

抽出的 ... 最も重要な文章を抽出
抽象的 ... 要約。元の文章に出現しない単語が含まれる場合がある。SummarizationPipeline はこちらのアプローチを用いている

>>> from transformers import pipeline

>>> summarizer = pipeline(task="summarization")
>>> summarizer(
    "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles."
)
[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}]

Translation

要約と同様に、seq-to-seq なタスクである。入出力はどちらもシーケンスとなる。

初期の翻訳モデルは単言語だったが、最近は多くの言語間での翻訳をサポートする多言語モデルへの関心が高まっている。

>>> from transformers import pipeline

>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning."
>>> translator = pipeline(task="translation", model="t5-small")
>>> translator(text)
[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}]

Language modeling

一連のテキスト内の単語を予測するタスク。事前学習済みのモデルは多くの下流タスク（これは特定の問題に特化した Fine-tune や転移学習を行うようなシーンを意識した用語と思われる）に応用できるため、NLP のタスクとして特に人気がある。

最近では zero-shot / few-shot learning (プロンプトのハックとはまた文脈が違いそう) に大きな関心が寄せられている。これは、モデルが明示的にトレーニングしていないタスクを解決しうることを意味する。

非常に流暢なテキストを生成できるようになるが、テキストの内容が常に正しいとは限らないことに注意する必要がある。

2つのタイプがある。

(1) Causal

モデルの目的は次のトークンやシーケンスを予測することであり、未来（＝先の）トークンはマスクされている。

>>> 	from transformers import pipeline

>>> prompt = "Hugging Face is a community-based open-source platform for machine learning."
>>> generator = pipeline(task="text-generation")
>>> generator(prompt)  # doctest: +SKIP

※任意の文章を与えて、その先に続くワードや文章を生成させるもの

(2) Masked

モデルの目的はシーケンス内のトークンはすべてわかっている状態で、「シーケンス中のマスクされたトークン」を予測すること。

>>> text = "Hugging Face is a community-based open-source <mask> for machine learning."
>>> fill_mask = pipeline(task="fill-mask")
>>> preds = fill_mask(text, top_k=1)
>>> preds = [
    {
        "score": round(pred["score"], 4),
        "token": pred["token"],
        "token_str": pred["token_str"],
        "sequence": pred["sequence"],
    }
    for pred in preds
]
>>> preds
>>> [{'score': 0.2236,
  'token': 1761,
  'token_str': ' platform',
  'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}]

※所謂「穴埋め問題」を解かせるもの

Multimodal

特定の問題を解くために、テキスト／画像／音声／動画など複数の data modality を処理する必要がある。

たとえば画像キャプションの生成。これは、モデルへ画像を入力し、テキストを出力する multimodal タスクの一種である。

データ型の組み合わせは色々あるが、基本的には前処理ステップですべてのデータを埋め込表現に変換する工程が入る。例えばキャプション生成の場合だと、モデルは画像の埋め込み表現とテキストの埋め込み表現の関係性を学習するものになる。

Document question answering

「文書」から回答するタスク。これは例えば、請求書から金額や宛名を検出したりするタスクがこの分野に該当する。

※ LayerX 社が帳票読み取りの OCR で検証している LauoutLMv3 などはこれに該当するはず

次の画像を Document question answering タスクとして解かせてみる例が以下のスニペット。

>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> url = "https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/2/image/image.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> doc_question_answerer = pipeline("document-question-answering", model="magorshunov/layoutlm-invoices")
>>> preds = doc_question_answerer(
    question="What is the total amount?",
    image=image,
)
>>>preds
[{'score': 0.8531, 'answer': '17,000', 'start': 4, 'end': 4}]

hassaku63

How 🤗 Transformers solve tasks

https://huggingface.co/docs/transformers/tasks_explained

"What 🤗 Transformers can do" でだいたい何ができるのかわかったので、ここではより実装や中身に近い話を見ていく。

※ここで、transformer ではない CNN の話なども出てくるらしい。

問題がどのように解決されていくのか見るために、特定のモデルの中も見ていく。

Wav2Vec2 for audio classification and automatic speech recognition (ASR)
Vision Transformer (ViT) and ConvNeXT for image classification
DETR for object detection
Mask2Former for image segmentation
GLPN for depth estimation
BERT for NLP tasks like text classification, token classification and question answering that use an encoder
GPT2 for NLP tasks like text generation that use a decoder
BART for NLP tasks like summarization and translation that use an encoder-decoder

※本筋と関係ないけど BERT / BART っていう発音が似たモデル名をつけないで欲しいと思った（感想）

このへんは図もあるし全部を一気にフォローできないのでスキップする。

Encoder / Decoder の話は共通して出てくるので、それは別途押さえておくと良さそう