【Paper】BERT part1 ~ Langage Models are Few-Shot Learners ~


This article is a explaination of BERT(Bidirectional Encoder Representations from Transformers.)

1. Intorduction

Now, LLM model is separatable to 2 pattern from point of how to treat down-stream tasks.
One is a feature-based. This method making model that using pretrained representation as features.
Another one is a fine-tuning. This method modifying the back part of pretrained model(example, only changes number of linear layer of classification model for fit a different classification task) and retrain with new model, this use pretrained parameter as new model's weights.

Some LLM model fine-tuning based has a restrict that can only learning one-directional. For
example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers
of the Transformer(Vaswani et al., 2017).

It could be very harmful in like Q&A tasks, incorporate information from both direction is important for recognize where is crucial point.

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers.


From this section, I introduel architecture of BERT.
BERT is constituted two phase, pre-training and fine-tuning.

pre-training: models are trained with unlabeled data over different pre-training tasks.
fine-tuning: models are initialised with the pretrained parameter, and all of the parameters are retrain by labeled data from downstream tasks.

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

