【Paper】BERT part1 ~ Langage Models are Few-Shot Learners ~


Original paper is here

This article is a explaination of BERT(Bidirectional Encoder Representations from Transformers.)

1. Intorduction

Now, LLM model is separatable to 2 pattern from point of how to treat down-stream tasks.
One is a feature-based. This method making model that using pretrained representation as features.
Another one is a fine-tuning. This method modifying the back part of pretrained model(example, only changes number of linear layer of classification model for fit a different classification task) and retrain with new model, this use pretrained parameter as new model's weights.

Some LLM model fine-tuning based has a restrict that can only learning one-directional. For
example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers
of the Transformer(Vaswani et al., 2017).

It could be very harmful in like Q&A tasks, incorporate information from both direction is important for recognize where is crucial point.

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers.


From this section, I introduel architecture of BERT.
BERT is constituted two phase, pre-training and fine-tuning.

pre-training: models are trained with unlabeled data over different pre-training tasks.
fine-tuning: models are initialised with the pretrained parameter, and all of the parameters are retrain by labeled data from downstream tasks.

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

Unfortunately, today is over. I'll write continuation at next time.


[1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, "Language Models are Few-Shot Learners", arxiv, 2020