🐥

AI界を席巻する「Transformer」をゆっくり解説(5日目) ～Model Architecture編 3～

2021/05/08に公開

AI界を席巻する「Transformer」を解説するシリーズ5日目です。

Attention Is All You Needの論文PDFはこちら

1日目：　Abstract
2日目：　Introduction / Background
3日目：　Model Architecture 1
4日目：　Model Architecture 2
5日目：　Model Architecture 3
6日目：　Why Self-Attention
7日目：　Training
8日目：　Results / Conclusion
9日目：　Source Code

シリーズ過去記事は一番下にリンク貼ってます。
それでは早速みていきましょう。

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

まとめると、Transformerは、Multi-Head Attentionを3つの方法で用いている。

• In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

1つ目は、Encoder-Decoder間の連結用のAttentionレイヤー。入力のクエリ $Q$ は1つ前のDecoderの出力から来る。またメモリのキー $K$ とキー値 $V$ はEncoderの出力から来る。

つまり、Dcoderは（Encoderもだが）、以前のシリーズ回でも示したように、 $N_x = 6$ 回積み重なった構造をしています。つまりグレーの部分が6回縦に連続するイメージなので、 $N_i$ 回目のDecoderの入力は、 $N_{i-1}$ 回目のDecoderの出力が該当するわけです。

それはいいとして、なぜそれがキーでもキー値でもなく、クエリなの？と思いますが、これはこの本論文の定義としてとらえた方がよいと思われます。

This allows every position in the decoder to attend over all positions in the input sequence.

これはすなわち、Decoderの中で翻訳されようとしている単語が文章中のどの位置にあったとしても、入力された翻訳前の文章のどの配列に対しても参照することができることを意味している。

つまり、通常の翻訳では翻訳中の単語の前の単語はなんだっけ？くらいにしか遡れないところを、このモデルにおいては、翻訳される文章の単語はどれだけ前に言われていたとしても、ちゃんと覚えていて、遡って参照することができることを意味しています。それを成立させているのが、クエリ $Q$ だということです。

This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].

このモデルは、下記の論文で紹介されるようなSequence to Sequence（Seq2Seqと書かれる）のような典型的なEncode-Decoder Attentionモデルを模している。

• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder.

2つ目はEncoderの中の、Self-Attentionレイヤー。このSelf-Attentionレイヤーの中では、キー $K$ 、キー値 $V$ 、クエリ $Q$ のすべてが同じところ、今回の場合で言うと、前のレイヤーのEncoderからくる。

つまり、Encoderは図の左側のことですが、これもDecoderと同じく、 $N_x$ 回、縦に積み重なった構造をしているので、 $N_1$ 回目を除いて、 $M_i$ 回目のEncoderは前のEncoderから入力を得ますよ、ということです。

Each position in the encoder can attend to all positions in the previous layer of the encoder.

なので、 $i$ 番目のEncoderの中の配列の中のどの単語要素においても、前の $i-1$ 番目のEncoderの中の配列のどの単語要素にも参照できるようになっている。

• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.

3つ目は、同様に、Decoderの中にあるSelf-Attentionレイヤー。図で言うところのDecoderである右側の中のグレーの中の3つあるレイヤーの下のAttentionレイヤーのことだ。これも2つ目の話と同様に、前の単語を遡って参照できる。Maskされているので、今翻訳しようとしてる単語より先の単語は参照することは出来ない。

We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.

自動的で回帰的な特性を維持するためには、Decoderにおける左方向への情報の流れを防ぐ必要がある。

言い方が変わってるだけで、要するに、何かの単語を翻訳する時はあくまで翻訳しようとしてる単語と翻訳済の単語しか参照にできないようにしていて、今回の言い方で言うと、翻訳した単語の情報が、すでに翻訳済の単語に影響を与える、つまり左方向に情報が流れることを阻止する必要がある、といっています。

We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

これを達成するために、Scaled Dot-Product Attentionの中で、Softmaxレイヤーの入力値にMaskレイヤーを入れて、クエリ $Q$ の中にある翻訳済の単語ベクトルに関して、単語の位置が $i$ 番目にいた場合は、 $i+1$ 番目以降の値を $- \infty$ に置き換えることで、Softmaxレイヤーにおいて、連結しないのと同義になるようにしています。

3.3 Position-wise Feed-Forward Networks

次に、Position-Wise Feed-Forward Networksの説明です。

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.

Attentionサブレイヤーに続いて、この図のEncoderとDecoderのそれぞれのレイヤーには、Feed-Forward Network（FFN）サブレイヤーがそれぞれあります。水色の部分です。

This consists of two linear transformations with a ReLU activation in between.

このサブレイヤーは、2つの線形変換とその間にReLU活性化関数を用いた構造をしています。数式で表すとこうです。

$\mathrm{FFN}(x) = \mathrm{max}(0, xW_1 + b_1)W_2 + b_2$

図で書くとこういう感じです。

これはこうやって導出されてます。

\begin{aligned} \mathrm{FFN}() &= f(x) \\ &= zW_2+b_2 \\ &= \mathrm{max}(0, y)W_2+b_2 \\ &= \mathrm{max}(0, xW_1 + b_1)W_2 + b_2 \end{aligned}

While the linear transformations are the same across different positions, they use different parameters from layer to layer.

線形変換関数は、入力の行列のどの要素に対しても同じ値をかけるのに対して、そこで使用される重み $W$ やバイアス $b$ には異なるパラメーターが使用します。

Another way of describing this is as two convolutions with kernel size 1.

別の言い方でこの線形変換を表すと、カーネルサイズが $1$ の2つのConvolutions（畳み込み）とみなすこともできます。

カーネルとはフィルターの意味で、Convolutional Convolutional Neural Network（CNN）という代表的な畳み込み処理で使用するフィルターのことを指す。Convolutionsに関しては、今となってはこの分野の基礎みたいなもので、たくさん記事があるので、1つわかりやすかったものを載せておきます。

通常は、元の画像や文章をベクトル変換して、1つのデータとしてまとめて、行列として扱った時に、そのサイズや次元数を減らすために、フィルターを通して、サイズや次元数を減らします。これを畳み込み、と呼びます。線形変換はそのフィルターサイズが $1$ のカーネルで畳み込みをした場合と同じとみなせるよ、ということです。

The dimensionality of input and output is $d_{model} = 512$ , and the inner-layer has dimensionality $d_{ff} = 2048$ .

このFFNの入力と出力の次元は $d_{model} = 512$ で、FFNの中のレイヤーの次元は $d_{ff} = 2048$ だ。 $ff$ はFeed-Forward の頭文字のffです。

ここで次元をあげるのは、次元数512だと表現しきれない特徴量も、次元2048であれば表現できる特徴量が増えるからです。

3.4 Embeddings and Softmax

EmbeddingレイヤーとSoftmaxレイヤーの説明です。ここまで触れられませんでしたが、ここで説明されます。図の一番下にあるピンクの2つのレイヤーと、一番右上にある薄い緑のレイヤーのことです。

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$ .

他の翻訳モデルと同じく、学習済のEmbeddingレイヤーを使うことで、入力トークンと出力トークンを、次元 $d_{model}$ のベクトルに変換します。

ここで、トークンとは英語のTokenがそのまま訳されてることが多いですが、ここでは文章の各単語の文字列のことです。これを今回のAIモデルが読めるようにするために、ベクトルに変換してます。この変換自体は本論文の論点ではないため、すでに学習された変換モデルを使用してるようです。

We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.

同様に、LiearレイヤーやSoftmaxレイヤーに関しても、一般的な学習済のモデルを使い、上図の右側のグレー部分のDecoderの出力を翻訳後の単語の予測確率のトークンに変換しています。

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30].

今回のモデルでは、過去の別の論文を参考に、図の下側にある2つのEmbeddingレイヤーとSoftmaxレイヤーの前にあるLinearレイヤーには、同じ重み $W$ を使用している。

Using the output embedding to improve language models: 2017年にSchool of Computer Science Tel-Aviv University, Israelが公開

余談ですが、Tel-Aviv（テルアビブ）はイスラエルの都市名で、イスラエルのシリコンバレーと呼ばれる場所です。

In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$ .

今回、Embeddingレイヤーでは、その重み $W$ に対して、 $\sqrt{d_{model}}$ をかけている。

なぜかけるのか？と思いましたが、その説明はなく、実験的にうまくいったため、もしくは慣習的にそうする場合があるためと思われます。

3.5 Positional Encoding

Positional Encodingについて、です。図の下から2番目のレイヤーで、これも2つあります。

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.

今回のモデルがRNNやCNNを用いていないので、文章の中の単語の位置をモデルに覚えさせるためには、配列の中のトークンに、何らからの相対的なのか、絶対値的なのか、位置情報を与える必要がありました。

To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.

そのため、Positional Encodingというレイヤーを、積み重なっているEncoderとDecoderのそれぞれの一番下に、入れることにしました。

The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed.

このPositional Encodingは、Embeddingレイヤーの $d_{model}$ と同じ次元数を持たせていて、そうすることで、Embeddingレイヤーの値に、Positional Encodingの値を加算できるようにしました。

There are many choices of positional encodings, learned and fixed [9].

Positional Encodingには様々な手法がありますが、本モデルでは、下記論文を参考にして少し修正したものを採用したとのこと。

Convolutional sequence to sequence learning: 2017年にFacebook AI Researchが公開

In this work, we use sine and cosine functions of different frequencies:

具体的には、別々の周波数を持つ、サイン関数とコサイン関数を、（明示的には書かれてませんが）左側と右側のPositional Encodingとして使用しました。

\begin{aligned} PE_{(pos, 2i)} &= sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos, 2i+1)} &= cos(pos/10000^{2i/d_{model}}) \end{aligned}

where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid.

ここで、 $pos$ は配列の要素の位置、つまり行列における何番目の列かで、 $i$ は次元数、つまり行列における何番目の行か、です。つまり、Positional Encodingの各次元はサイン曲線に該当します。

The wavelengths form a geometric progression from 2π to 10000 · 2π.

波長は $2 \pi$ から $10000 \times 2 \pi$ までの段階的な非連続の広がりを持っています。

つまり、例えば左側のEncoderに対するPositional Encodingで考えてみましょう。入力の文章が、「Yes, we can.」というオバマの懐かしい名言とすると、

1	2	3
Yes	we	can
そう	我々なら	出来る

となるわけですが、この場合、 $pos$ が単語の位置なので、 $1 \leqq pos \leqq 3$ 、 $i$ が各単語の特徴量を表す一時ベクトルの数、つまり次元数なので、これは今回 $512$ で行うとありますから、 $1 \leqq i \leqq 512$ をそれぞれ $1$ ずつ変化します。

つまり、入力が3つの単語であれば、こんなグラフの計算結果が位置Encodingとして加算されることになります。

入力する文章の単語が10個や、144個、ひいいては10000個になってくるとこんな波形になります。

確かに、この関数であれば、文章が10000単語などかなり長くなってきても、その中の単語の位置ごとに、加算される波形は違うものになりそうです。具体的には、単語の位置が大きくなるほど、高周波数の波形が足される、ということですね。

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ .

この関数で定義したのは、この関数であれば、今回のモデルがその単語のにオンされた相対的な位置を簡単に学習できるんじゃないか、と仮説をたてたからだ。というのも、任意のある固定オフセット、つまり任意のある位置と別の位置との差 $k$ に対して、 $PE_{pos+k}$ は $PE_{pos}$ の1次関数として、表現できるためだ。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).

実際、下記論文で提案された学習済の別のPositional Enbeddingと実験してみたところ、本論文と別論文のEncodingはそれぞれ、ほとんど近い結果を示した。以下の表に示す。

この表において、左端のbaseと(E)の行に対する右から2番目のBLEUスコア列を見てみると、実際にそれぞれ25.8と25.7であり、ほとんど一緒だったようだ。

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

ではほとんど効果が一緒なのになぜサイン曲線の方を選んだかというと、学習時に使用された文章の長さよりも、もっと長い文章に実際の推論時にぶつかった場合でも、うまく機能する可能性があったからです。

確かに、次元数は単語の特徴量を表すための1次ベクトルの数ですが、このPositional Encodingの関数であれば、次元512を使い切ってないので、文章が長くなっても、有効そうです。

おわり

AI界を席巻する「Transformer」を解説するシリーズ5日目は以上です。Model Architecutureを完了しましたね。次回はWhy Self-Attentionです。

感想や要望・指摘等は、本記事へのコメントか、TwitterのリプライやDMでもお待ちしております！

また、結構な時間を費やして書いていますので、投げ銭・サポートの程、よろしくお願いいたします！

シリーズ関連記事はこちら

【2023年5月追記】
また、Slack版ChatGPT「Q」というサービスを開発・運営しています。
こちらもぜひお試しください。
https://q-bot.suchica.com/

3.2.3 Applications of Attention in our Model

3.3 Position-wise Feed-Forward Networks

3.4 Embeddings and Softmax

3.5 Positional Encoding

おわり

Discussion