【Paper】BERT part2 ~2 phase model~


Original paper is here.

This article is a part2 of BERT explaination.
Previous one is here.


2.0 Embedding

BERT adopting 3 type embedding both of training and inferring. Those have each objective for model capture of words(sentences) feature appropriately.

・The image of whole embedding

Quote: [3]

1. Token Embedding

This embedding Deviding input to token. Example, the sentence
"I am swimming!"
will be like
([CLS], I, am, swim, ##ing, !, [SEP])
and further becomes
([0.54, 0.22, 0.99...], [0.32, ...], [...], [...], [...], [...]).

This called takenization, dividing sentence a part and transform to vector for easy to undestand as models.

For efficiency, deviding swimming to swin and ##ing. Example, it makes we can treat walking from only walk.
[CLS] and [SEP] are abbreviations of classification and separate. [CLS] is added at start of sentence, [SEP] is added at end or separating point of sentence.
And, the neural net is used for trainsform to vector.

2. Positional Embedding

This embedding provides information about the position of the tokens in the sequence.
Most simply way to make it is adding number of index to vector itself, it can add information about vector place. That indicates the vector has bigger number is placing at the back.

But this way is too inefficient because have to treat so many and big numbers. So Normally, using like cosine embedding method(it can embed word order information efficiently).

3. Segment(type) Embedding

I'll explain this in after NSP section.
This embedding differentiates between the two sentences in the input during NSP, but it is also included in MLM pre-training to maintain consistency in the embedding process.

2.1 Pre-training

The below image is overview of pre-training and fine-tuning.
In this section, explain about pre-training.

Quote: [1]

BERT adopted 2 types pre-training that don't need labeled data.

1. Masked LM (MLM)

Before feeding words into transformer encoder, 15% of words are masked with a [MASK] token.
Then, BERT attempt to predict a masked word by Linear Head.

Quote: [2]

Here, Transformer encoder gets the feature and relation of input sentences as internal representation(vector). And Fully-connected layer(with GELU(activation), normalization) use it to attempt to predict masked word as probability with softmax.

2. Next Sentence Prediction (NSP)

Before feeding words into transformer encoder, the continuos sentence is separated to a part by [SEP], and then, a following part is swapped to another part from random point of input in 50%.

The model have to predict the part is correct order sentence or not, so model calculating the probability of IsNextSewuence with softmax.

Quote: [1]

3. Pre-training

And Importantly, A and B are done at the same time as pre-training with so many sentence data:
・BooksCorpus: Contains 800 million words.
・English Wikipedia: Contains 2,500 million words (2.5 billion words).
In total, BERT was pre-trained on approximately 3.3 billion words.

This is the pre-training of BERT.

2.2 Fine-tuning

It is not difficlut fitting the BERT to a specific task. It is realized by adding a small layer.

1. Classification

Thinking about classification that involves assigning a label to an entire input sequence, such as sentiment analysis (positive/negative) or topic classification.

Fine-tuning Process:

  1. Model Architecture
    Add a classification layer (usually a fully connected layer) on top of BERT.
  2. Input Representation
    Use the [CLS] token’s representation from the final hidden layer as the aggregate representation of the sequence.(Aggregation of information into [CLS] occurs by loss function. This is the similar flow as NSP.)
  3. Training Data
    Prepare a labeled dataset where each input sequence is associated with a specific class label.
  4. Loss Function
    Use a suitable loss function, such as cross-entropy loss, for training.
  5. Training
    Fine-tune the model on the labeled dataset, allowing the model to adjust its weights to minimize the classification loss.

・For sentiment analysis, the input sentence "I love this movie" would be classified as "positive."

2. Question Answering

Thinking about question answering that involves identifying the span of text in a passage that answers a given question.

Fine-tuning Process:

  1. Model Architecture
    Add two output layers on top of BERT: one for predicting the start position and one for predicting the end position of the answer.
  2. Input Representation
    Feed a pair of sequences into BERT: the question and the context passage, separated by a [SEP] token.
  3. Training Data
    Prepare a labeled dataset where each question-context pair has the start and end positions of the answer span within the context passage.
  4. Loss Function
    Use the sum of the start and end position losses (typically cross-entropy loss) for training.
  5. Training
    Fine-tune the model on the labeled dataset, adjusting its weights to predict the correct start and end positions of the answer span.

・For the question "What is the capital of France?" and the context "Paris is the capital of France," the model would predict the start and end positions corresponding to "Paris."

3. Named Entity Recognition (NER)

Thinking about NER that involves labeling each token in a sequence with an entity type (e.g., PERSON, LOCATION, ORGANIZATION).

Fine-tuning Process:

  1. Model Architecture
    Add a token classification layer on top of BERT.
  2. Input Representation
    Use the token representations from the final hidden layer.
  3. Training Data
    Prepare a labeled dataset where each token in the input sequence is tagged with an entity label.
  4. Loss Function
    Use a suitable token classification loss, such as cross-entropy loss for each token.
  5. Training
    Fine-tune the model on the labeled dataset, allowing the model to adjust its weights to predict the correct entity labels for each token.

Output modefing:
On top of BERT, a token classification layer is added. This layer assigns a label to each token based on its contextualized representation. The possible labels depend on the specific NER task, commonly using a BIO or IOB scheme:
B-PER: Beginning of a person's name
I-PER: Inside a person's name
B-LOC: Beginning of a location name
I-LOC: Inside a location name
B-ORG: Beginning of an organization name
I-ORG: Inside an organization name
O: Outside any named entity

Input modifing:
For example, given the sentence "Barack Obama was born in Hawaii," the training data would be:
["Barack", "Obama", "was", "born", "in", "Hawaii", "."]
["B-PER", "I-PER", "O", "O", "O", "B-LOC", "O"]

For example, the sentence "Sundar Pichai is the CEO of Google." might be tagged as:
["Sundar", "Pichai", "is", "the", "CEO", "of", "Google", "."]
["B-PER", "I-PER", "O", "O", "O", "O", "B-ORG", "O"]


This time, we explained BERT, which boasts top performance in the field of natural language processing.
BERT recognizes context from both directions, and is able to recognize sentences at a high level by performing general-purpose pre-training followed by fine-tuning for specific tasks.

That's all for the explanation of BERT.
Thank you for reading.


By the way, GPT is also pre-training and fine-tuning model, the difference between those is the purpose.
BERT aimed to recognize sentences maening and relation from bidirectional. In contrast, GPT aimed to predict the next word by information before that.(it use only one direction information)
They Inspired by the same model(transformer) and have same procedure, but the concept is different.


[1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, "Language Models are Few-Shot Learners", arxiv, 2020
[2] Rani Horev, "BERT Explained: State of the art language model for NLP", Medium, 2018
[3] "BERT Embeddings", Tinkerd, 2023