【Transformer】Difference of Post-LN and Pre-LN
difference of those is different of potision that apllying layer normalization on transformer architecture.
・Comparison
Quote: [1]
1. Post-LN
Overview
Layer normalization is applied after the residual connection (i.e., the output of the layer, such as self-attention or feed-forward networks, is added to the input of that layer, and then normalized).
Use Case
Post-LN is the original design used in the Transformer model (like those in the original BERT or GPT). It tends to work well in settings where training is stable and not very deep (fewer layers).
2. Pre-LN
Overview
Layer normalization is applied before the residual connection (i.e., the input to each layer is normalized before processing in self-attention or feed-forward networks).
feature
- Stability and Training Depth: Pre-LN can support deeper networks because it helps in mitigating the vanishing gradient problem better than Post-LN.
- Learning Rate Sensitivity: Pre-LN models are less sensitive to changes in learning rate, thus allowing for more aggressive training schedules.
Use Case
Pre-LN tends to be more effective for deeper models or when training with higher learning rates. It helps in stabilizing the training process and is often preferred in newer model architectures, like GPT-3 or later versions of Transformers.
Reference
[1] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, "On Layer Normalization in the Transformer Architecture", arxiv, 2020
Discussion