Hugging Face NLP Course - 2. USING ð€ TRANSFORMERS
æŠèŠ
ã®èŠç¹çºãã
Behind the pipeline
pipelineããã£ãŠããããš
Preprocessing with a tokenizer
tokenizerããã£ãŠããããš
ãã®ãã¹ãŠã®ååŠçã¯ãã¢ãã«ãäºååŠç¿ããããšããšãŸã£ããåãæ¹æ³ã§è¡ãããå¿
èŠãããã
- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model
ããŒã«ãã€ã¶ãŒã®èªã¿èŸŒã¿
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
ããŒã«ãã€ãº
return_tensors="pt"
ã§pytorchçšã®ãã³ãœã«ãè¿ã
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
{
'input_ids': tensor([
[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]
]),
'attention_mask': tensor([
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
])
}
Going through the model
ã¢ãã«ã®äœæïŒããŒãïŒ
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
ã¢ãã«ã«inputsãæž¡ããšãhidden statesïŒé ãç¶æ ïŒãŸãã¯featuresïŒç¹åŸŽïŒãšåŒã°ãããã®ãè¿ãã
hidden statesã¯ãã®ãŸãŸã§äœ¿çšãããããšãããããéåžžã¯headãšåŒã°ããéšåãžã®å
¥åã«äœ¿çšãããã
headã¯ã¿ã¹ã¯ã«ãã£ãŠç°ãªãã
A high-dimensional vector?
ã¢ãã«ã®äžè¬çãªæ»ãå€
-
Batch size: äžããæç« ã®æ°
-
Sequence length: äžããæç« ã®ããŒã¯ã³æ°
-
Hidden size: é«æ¬¡å ã®ãã¯ã¿ãŒãµã€ãº
確èª
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
torch.Size([2, 16, 768])
outputsã¯namedtuplesã®æ§ã«æ¯ãèãã®ã§ã
outputs[0]
ã®æ§ã«ã€ã³ããã¯ã¹ã¢ã¯ã»ã¹ãåºæ¥ãã
以äžã¯outputs.last_hidden_state.shapeãšåãæå³ã«ãªãã
print(outputs[0].shape)
torch.Size([2, 16, 768])
Model heads: Making sense out of numbers
headã¯hidden statesãšããŠååŸããé«æ¬¡å
ãã¯ãã«ãæå®ã®æ¬¡å
ã«å€æããã
åæ°ãããã¯è€æ°ã®linear layerã§æ§æãããã
ã¿ã¹ã¯ã«å¿ããæ§ã ãªã¢ãŒããã¯ãã£
*Model (retrieve the hidden states)
*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification
and others ð€
sequence classification head
ãå®è£
ããã¢ãã«ã䜿çšããäŸã
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)
torch.Size([2, 2])
Postprocessing the output
logitsã衚瀺
print(outputs.logits)
tensor([[4.0195e-02, 9.5980e-01],
[9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)
1ã€ãã®ã»ã³ãã³ã¹
[0.0402, 0.9598]
2ã€ãã®ã»ã³ãã³ã¹
[0.9995, 0.0005]
åã€ã³ããã¯ã¹ãšã©ãã«ã®å¯Ÿå¿
model.config.id2label
{0: 'NEGATIVE', 1: 'POSITIVE'}
First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005
Models
Creating a Transformer
Bertã¢ãã«ãçæããäŸ
éã¿ã¯å®å
šã«ã©ã³ãã
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)
print(config)
BertConfig {
[...]
"hidden_size": 768,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
[...]
}
Different loading methods
èšç·Žæžã¿ã®éã¿ã§ã¢ãã«ãããŒã
å®éã¯åçã® AutoModel ã¯ã©ã¹ã䜿ãã»ãã奜ãŸããã
ïŒã¢ãŒããã¯ãã£ãå€ãã£ãŠã察å¿ã§ããããïŒ
BertConfigããã§ãã¯ãã€ã³ãã®äœè
ãèšå®ãããã®ã«ãªãã
ã¢ãã«ã«ãŒãã«è©³çŽ°ãèšèŒããŠããã
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")
éã¿ããã£ãã·ã¥ãããäœçœ®
â»HF_HOMEã§å€æŽå¯èœ
~/.cache/huggingface/transformers
BertModelã«å¯Ÿå¿ãããã§ãã¯ãã€ã³ãã®äžèŠ§
Saving methods
ã¢ãã«ã®ã»ãŒã
model.save_pretrained("directory_on_my_computer")
2ã€ã®ãã¡ã€ã«ãä¿åããã
!ls directory_on_my_computer
config.json pytorch_model.bin
config.jsonã®äžèº«
!cat directory_on_my_computer/config.json
{
"_name_or_path": "bert-base-cased",
"architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.34.0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 28996
}
Tokenizers
Word-based
åèªåäœã§åå²ãããã¿ãŒã³
空çœã§åå²ããã€ã¡ãŒãž
å¥èªç¹ã®ã«ãŒã«ãå å³ãããã¿ãŒã³ãæãã
vocabulariesïŒèªåœïŒã¯ç¬ç«ããããŒã¯ã³ã®ç·æ°ã®äºã
ååèªã«ã¯IDãå²ãåœãŠããã0ããå§ãŸãèªåœã®å€§ãããŸã§å²ãåœãŠããããã¢ãã«ã¯ãããã®IDã䜿ã£ãŠååèªãèå¥ããã
ãã¹ãŠã®åèªãã«ããŒãããšè±èªã ãã§ã500,000 wordsãããèšå€§ãªIDæ°ã«ãªãã
ãŸãåæ°åœ¢è€æ°åœ¢ãªã©ãå¥ã®ããŒã¯ã³ãšããŠç®¡çãããã
èªåœã«ãªãããŒã¯ã³ãâunknownâ ããŒã¯ã³ãšããŠæ±ãã
â[UNK]âãââãšããŠè¡šçŸãããã
Character-based
æååäœã§åå²ãããã¿ãŒã³
vocabulariesïŒèªåœïŒãå°ãªããªãã
âunknownâ ããŒã¯ã³ãå°ãªããªãã
ãªã©ã®ã¡ãªãããããã
äžæ¹ã§
æåèªäœã¯æå³ãæã£ãŠããªããâ»äžåœèªãªã©äŸå€ã¯ããã
ããŒã¯ã³æ°ãèšå€§ã«ãªãã
ãªã©ã®åé¡ãããã
Subword tokenization
äžèšïŒã€ã®ã¢ãããŒãã䜵çšããè¯ããã¿ãŒã³ã
åèªãæŽã«åå²ãã
é »ç¹ã«äœ¿ãããåèªã¯ããå°ããªãµãã¯ãŒãã«åå²ãã¹ãã§ã¯ãªãããåžå°ãªåèªã¯æå³ã®ãããµãã¯ãŒãã«å解ãã¹ãã§ãããšããååã«åºã¥ããŠããã
æå³ãç¶æãã€ã€èªåœæ°ãæããããšãå¯èœã
And more!
ãã®ä»ã®ãã¯ããã¯ããã
- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models
Loading and saving
ããŒã«ãã€ã¶ãŒã®ããŒãã®äŸ
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
AutoTokenizerã§ããŒãããäŸïŒæšå¥šïŒ
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")
{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
ããŒã«ãã€ã¶ãŒã®ä¿å
tokenizer.save_pretrained("directory_on_my_computer")
Encoding
ããã¹ããæ°åã«å€æããããšããšã³ã³ãŒããšããã
ãšã³ã³ãŒãã£ã³ã°ã¯ããããŒã¯ã³åããšãå
¥åIDãžã®å€æããšãã2段éã®ããã»ã¹ã§è¡ãããã
-
ããŒã¯ã³å: ããã¹ãã®ã¹ããªãã
ãã®ããã»ã¹ã«ã¯è€æ°ã®ã«ãŒã«ããããã¢ãã«ãäºååŠç¿ããããšããšåãã«ãŒã«ã䜿çšããããã«ãã¢ãã«ã®ååã䜿çšããŠããŒã¯ãã€ã¶ãã€ã³ã¹ã¿ã³ã¹åããå¿ èŠãããã -
å ¥åIDãžã®å€æ: ããŒã¯ã³ã®æ°å€å
åæ§ã«ã¢ãã«ãäºååŠç¿ããããšããšåãã«ãŒã«ã䜿çšããå¿ èŠãããã
以äžã§ãããããå¥ã«å®è¡ããŠã¿ãã
ïŒèª¬æã®ãããªã®ã§å®éã®éçšã§ã¯äžæ°ã«ããã°è¯ããïŒ
Tokenization
Subword tokenizationã®äŸ
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']
From tokens to input IDs
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
[7993, 170, 11303, 1200, 2443, 1110, 3014]
Decoding
ããŒã¯ã³IDãããŒã¯ã³ã«åŸ©å
ã
åãåèªã®äžéšã§ãã£ãããŒã¯ã³ãã°ã«ãŒãåããŠãããŠããããšã«æ³šæã
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
'Using a Transformer network is simple'
Handling multiple sequences
Models expect a batch of inputs
ã¢ãã«ã®å ¥åã«æ¬¡å ãåãããå¿ èŠããã
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)
output = model(input_ids)
print("Logits:", output.logits)
Input IDs: [[ 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]
Logits: [[-2.7276, 2.8789]]
Padding the inputs
è€æ°æç« ããããã§å ¥åããéãããã£ã³ã°ããŒã¯ã³ã§ããŒã¯ã³æ°ãæããå¿ èŠãããã
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
[ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)
[200, 200]
[200, 200, tokenizer.pad_token_id]
ã§ãçµæãå€ããããšã«æ³šæããã
ããã£ã³ã°ããŒã¯ã³ãæšè«ã«äœ¿çšãããããã
attention maskã䜿çšããããšã§è§£æ±ºã§ããã
Attention masks
attention maskãæå®ããããšã§ã¢ãã³ã·ã§ã³å±€ã«ç¡èŠããŠãããã
[200, 200]
ãæž¡ããæãšçµæãäžèŽããã
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]
attention_mask = [
[1, 1, 1],
[1, 1, 0],
]
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)
tensor([[ 1.5694, -1.3895],
[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
Longer sequences
ã¢ãã«ã«æž¡ããããŒã¯ã³æ°ã«ã¯éçãããã
ã»ãšãã©ã®ã¢ãã«ã¯512ãŸãã¯1024ããŒã¯ã³ãŸã§ã
ãã以äžã®ããŒã¯ã³æ°ãæž¡ããšã¯ã©ãã·ã¥ããã
ãã®åé¡ã®è§£æ±ºçã¯
-
ããé·ãé åé·ããµããŒãããã¢ãã«ã䜿ãã
LongformerãLEDç -
ã·ãŒã±ã³ã¹ãåãæšãŠãã
sequence = sequence[:max_sequence_length]
Putting it all together
äžè¬çãªå®è¡äŸ
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
è€æ°ã®æãããŒã«ãã€ãºãããã¿ãŒã³
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)
ããã£ã³ã°ã®ãã¿ãŒã³
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
trancate(åãæšãŠ)ã®ãã¿ãŒã³
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
åãã¬ãŒã ã¯ãŒã¯ã«å¯Ÿå¿ããè¡åãåŸã
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
Special tokens
ã¢ãã«ã«ãã£ãŠã¯æåãšæåŸã«ç¹å¥ãªããŒã¯ã³ãä»äžãããããšãããã
äºååŠç¿æã«ä»äžããŠåŠç¿ãããŠããå Žåã
æšè«æãããã¡ã€ã³ãã¥ãŒãã³ã°æãåæ§ã«åŠçããå¿
èŠãããã
ïŒæå®ã®ããŒã«ãã€ã¶ãŒã䜿ã£ãŠããã°åé¡ãªãïŒ
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
ãã³ãŒãããŠã¿ã
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))
"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."
Wrapping up: From tokenizer to model
ä»ãŸã§ã®ãŸãšã
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
Discussion