Survey on Positional Encoding
調査内容
- なぜ三角関数なのか?(なぜ周期的性質が必要か?)
- 加算によって位置情報を表現できるのはなぜか?
原論文:Attention Is All You Need
3.5 Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,
learned and fixed [9].
In this work, we use sine and cosine functions of different frequencies:
P E(pos,2i) = sin(pos/100002i/dmodel)
P E(pos,2i+1) = cos(pos/100002i/dmodel)
where pos is the position and i is the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of
P Epos.
We also experimented with using learned positional embeddings [9] instead, and found that the two
versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version
because it may allow the model to extrapolate to sequence lengths longer than the ones encountered
during training.
要約
- Self-AttentionにはConvやRNNなどがなく、系列的に取り扱わないため、絶対的もしくは相対的に一尾情報を注入することが必要
- そこで、Positional EncodingをEncoder, Decoder部分に採用
- 既存研究[9]で利用されていた、Position Embeddingsに置き換えても同精度を達成
- 同様の精度を達成できたことは、学習時よりも長いシーケンスに対しても外挿が可能であること
原論文で引用していた先行研究[9]:Convolutional Sequence to Sequence Learning
論文内でPosition Embeddingsを提案
3.1. Position Embeddings
First, we embed input elements x = (x1, . . . , xm) in distributional space as w = (w1, . . . , wm), where wj ∈ R f is a column in an embedding matrix D ∈ R V ×f.
We also equip our model with a sense of order by embedding the absolute position of input elements p = (p1, . . . , pm) where pj ∈ R f.
Both are combined to obtain input element representations e = (w1 + p1, . . . , wm + pm).
We proceed similarly for output elements that were already generated by the decoder network to yield output element representations that are being fed back into the decoder network g = (g1, . . . , gn).
Position embeddings are useful in our architecture since they give our model a sense of which portion of the sequence in the input or output it is currently dealing with (§5.4).
要約
- 埋め込み行列
から獲得される分散表現に絶対値情報を表す\mathcal{D} を加算p - この手法は今回のモデルに対して有効に機能したとのこと
おそらくPositional Encodingは絶対位置だけでなく、相対位置でもいいのではという意味もこめて記述していると考えられる
5.4. Position Embeddings(一部抜粋)
We start with an experiment that removes the position embeddings from the encoder and decoder (§3.1).
These embeddings allow our model to identify which portion of the source and target sequence it is dealing with but also impose a restriction on the maximum sentence length.
Table 4 shows that position embeddings are helpful but that our model still performs well without them.
Removing the source position embeddings results in a larger accuracy decrease than target position embeddings.
However, removing both source and target positions decreases accuracy only by 0.5 BLEU.
We had assumed that the model would not be able to calibrate the length of the output sequences very well without explicit position information, however, the output lengths of models without position embeddings closely matches models with position information.
This indicates that the models can learn relative position information within the contexts visible to the encoder and decoder networks which can observe up to 27 and 25 words respectively.
Recurrent models typically do not use explicit position embeddings since they can learn where they are in the sequence through the recurrent hidden state computation.
In our setting, the use of position embeddings requires only a simple addition to the input word embeddings which is a negligible overhead.
要約
- Position Embeddingsを削除して検証した
- Position Embeddingsにより、Source, Targetの系列を識別できシーケンスの長さに制限をかけることができると仮説を立てていた
- 削除すると予想に反し、BLEUは0.5だけの低下におさまった
- position embeddingsの有無で出力シーケンスの長さを比較したところほぼ同じだった
- これは25~27wordsの範囲内であればEncoder, Decoderのみで位置情報を学習できていることを示しているとのこと
- RNNはhidden unitでこの役割を果たしているが、それと比較するとオーバーヘッドは少ない
読んだ感じ、あまり仮説どおりの結果にならなかった感じか?
2つの論文は、データセットがWMTなので学習するシーケンス長の違いはなさそう。