🦁

AI界を席巻する「Transformer」をゆっくり解説(6日目) ～Why Self-Attention編～

2021/05/09に公開

AI界を席巻する「Transformer」を解説するシリーズ6日目です。

Attention Is All You Needの論文PDFはこちら

1日目：　Abstract
2日目：　Introduction / Background
3日目：　Model Architecture 1
4日目：　Model Architecture 2
5日目：　Model Architecture 3
6日目：　Why Self-Attention
7日目：　Training
8日目：　Results / Conclusion
9日目：　Source Code

シリーズ過去記事は一番下にリンク貼ってます。
それでは早速みていきましょう。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1, ..., x_n)$ to another sequence of equal length $(z_1, ..., z_n)$ , with $x_i$ , $z_i \in R_d$ , such as a hidden layer in a typical sequence transduction encoder or decoder.

この章では、Self-Attentionの様々な側面を、RNNやCNNレイヤーと比較する。改めて、RNNやCNNは、ある可変長の配列 $(x_1, ..., x_n)$ を別の同じ長さの配列 $(z_1, ..., z_n)$ に変換するのによく使われるレイヤーで、EncoderやDecoderなどの一般的な配列変換装置の中の隠れ層に使われる。ここで、 $x_i$ , $z_i \in R_d$ である。

Motivating our use of self-attention we consider three desiderata.

Self-Attentionを使おうとした背景には、3つの大きな苦痛があったからだ。

desiderataというのは私は初めてみた単語ですが、desideratumの複数形で、直訳は「切実な要求、不足を痛感する物事」です。余談ですが、CrystalClearとか日本語的な冗長だけど表現力のある一言の言葉があると思うと、面白いです。

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

1つ目はレイヤーごとの計算的な複雑さ。2つ目はその計算量。計算自体は並列化して行えるが、その計算量は必要最小限の配列計算処理の量で測れる。

The third is the path length between long-range dependencies in the network.

3つ目は、モデルのネットワーク内において、遠く離れた位置にあるデータ同士の依存関係。

Learning long-range dependencies is a key challenge in many sequence transduction tasks.

遠い位置にある単語同士の依存関係を学習することは、多くの文章翻訳タスクにおいて、重要な課題だ。

人間に例えると、だいぶ前のことでも思い出して紐づけることが出来ますからね。お笑いにおける天丼は高度な技だった、というわけです。

One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.

これを長期の記憶、とした時に1つの重要な要素になるのは、ネットワーク上で前（順方向）や後ろ（逆方向）に行き来する信号の経路の長さです。

The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12].

入力と出力の配列間のこれらの長さが短くなればなるほど、長期的な記憶を維持するのが簡単になる。これは別の論文でも明らかになっている。

https://www.aclweb.org/anthology/D09-1087.pdf: 2009年にUniversity of Marylandなどが公開

Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

なので、色々な異なるレイヤーで出来たネットワークにおいて、入力と出力の配列間の任意の2つの単語間の最長の長さを比較した。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.

表1で示すように、Self-Attentionレイヤーは配列の各単語同士の距離はどこをとっても一定のある定数倍の計算量になるが、RNNレイヤーなどはそれが $n$ 倍で増える。

ここでいう $n$ は配列の長さ、つまり列数、つまり文章内の単語の多さ、つまり文章がどれだけ長いか、という数字だ。これが定数倍ということは、Self-Attentionレイヤーにおける信号の行き来する距離、計算量は文章の長さに依存しないということになる。

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations.

表1にあるように、計算の複雑性という観点においても、ある条件においては、Self-AttentionレイヤーはRNNレイヤーよりも早い。ある条件というのは、配列の長さ、つまり文章の長さ $n$ が次元 $d$ よりも小さいときだ。これはけっこうなケースで当てはまる。例えば、下記の論文で提案されているWord-PieceやByte-Pairモデルのような非常に優れたモデルにおいてもそうだ。

To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.

非常に長い文章を翻訳する計算時間を早くするために、Self-Attention (restricted) というのも考えた。これは今翻訳しようとしてる単語位置を中心として、サイズ $r$ の範囲にある単語は加味するという制限を加えたバージョンだ。

This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work.

しかしこの場合は、単語間の最長距離は $n/r$ のオーダーで増えるようだった。4番目の行です。これに関しては、今後研究したい。

ちょっと制限を加えたバージョンのSelf-Attentionも考えたが、なぜか性能がよくならないという結果になったとのこと。

A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions.

カーネル、つまり畳み込み用のフィルターのサイズが $k < n$ となるような1層からなるCNNは、入力と出力のすべてのペアを紐づけはしない。

Doing so requires a stack of $O(n/k)$ convolutional layers in the case of contiguous kernels, or $O(log_k(n))$ in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network.

そうするためには、連続したカーネルの場合は $n/k$ のオーダーで積み重なったCNNレイヤーが必要で、拡張型の畳み込み構造の場合は $log_k(n)$ のオーダーで積み重なったCNNレイヤーが必要。しかしそうすると、単語間の最長距離が長くなってしまうデメリットがある。

それが表の3行目の右端の列だ。

Convolutional layers are generally more expensive than recurrent layers, by a factor of k.

CNNレイヤーは、一般的には、RNNのレイヤーよりも、カーネルサイズ $k$ の分だけ計算コストが高くなる。

Separable convolutions [6], however, decrease the complexity considerably, to $O(k \cdot n \cdot d + n \cdot d^2)$ .

下記の論文で紹介されるSeparable Convolutionsというレイヤーであれば、この複雑性を大幅に減らすことができる。オーダーでいうと、 $O(k \cdot n \cdot d + n \cdot d^2)$ 程度にまで減らせる。

Deep learning with depthwise separable convolutions: 2017年にGoogleが公開

これは表には載せられてないが、 $O(k \cdot n \cdot d^2)$ と比較すると、小さくなっている。

Even with $k = n$ , however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

ちなみに、このSeparable Convolutionsの場合、カーネルサイズ $k$ が単語数 $n$ と同じ場合、本論文で採用したSelf-Attentionと複雑性は計算の規模レベルでは同じになる。

As side benefit, self-attention could yield more interpretable models.

副次的な効果として、Self-Attentionはより人間的な翻訳モデルになる可能性もある。

We inspect attention distributions from our models and present and discuss examples in the appendix.

その点に関しては、付録にモデルの分布例を載せているので、そこで議論する。

Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

Attentionは、様々な翻訳タスクを明確に学習するだけでなく、その文章の代名詞的な部分や意味的な構造を加味して翻訳を行ってくれるケースが多い。

おわり

AI界を席巻する「Transformer」を解説するシリーズ6日目は以上です。今までの章でも語られてきた内容の復習的なことも多かったですね。次回はTrainingです。

感想や要望・指摘等は、本記事へのコメントか、TwitterのリプライやDMでもお待ちしております！

また、結構な時間を費やして書いていますので、投げ銭・サポートの程、よろしくお願いいたします！

シリーズ関連記事はこちら

【2023年5月追記】
また、Slack版ChatGPT「Q」というサービスを開発・運営しています。
こちらもぜひお試しください。
https://q-bot.suchica.com/

4 Why Self-Attention

おわり

Discussion