🐴

AI界を席巻する「Transformer」をゆっくり解説(2日目) ～Introduction / Background編～

2021/05/04に公開

AI界を席巻する「Transformer」を解説するシリーズ2日目です。

Attention Is All You Needの論文PDFはこちら

1日目：　Abstract
2日目：　Introduction / Background
3日目：　Model Architecture 1
4日目：　Model Architecture 2
5日目：　Model Architecture 3
6日目：　Why Self-Attention
7日目：　Training
8日目：　Results / Conclusion
9日目：　Source Code

早速みていきましょう。

Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5].

RNN、特にlong short-term memory（LSTM）やgated RNNは、言語モデルや機械翻訳のような時系列データに関する問題において、これまで最良のモデルだった。

いくつか言葉を解説すると、

RNN：　再帰型ニューラルネットワーク。時系列データによく利用されるモデル。時系列データとは文章などの自然言語処理や、売上や株価などの過去から未来を類推するような場合を言います。詳細は要望があれば別途
LSTM：　Long Short Term Memoryのこと。RNNの一種。一般的なRNNと違って、長期的な記憶力と関連付けが出来る
gated RNN：　Gated Recurrent Neural Network（ゲート付きRNN）。Gated Recurrent Unit（GRU）を持つRNN。GRUはLSTMと似ていて、学習時の勾配消失や勾配爆発を防ぐための仕組みが工夫されており、それによって長期の記憶力と関連付けが出来る

です。

LSTMやLSTMとRNNとの違い等は本記事では詳細述べませんが、こちらの記事がわかりやすかったです。

Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

多くの人が、このエンコーダー・デコーダーのアーキテクチャーで出来た言語モデルの限界を超えようと努力してきた。

boundaryを境界だと思って読んでましたが、おそらく、push the boundaries で「限界を超える」という熟語ですね。

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.

RNNのようなモデルは通常、入力と出力の時系列データの位置情報に沿って、各要素の計算を行います。1ステップ計算する際に、 $h_t$ という隠れ状態の時系列データを作ります。これは前時刻 $t-1$ における隠れ状態 $h_{t-1}$ と現時刻 $t$ における入力データからなります。

文章だけでも何となくわかるかもしれませんが、絵でみてみましょう。
こちらの記事から図は拝借しました。

特に右側の点線で囲まれた時刻 $t$ を見てください。
ここで、

$t$ は時刻であり、ここでいう位置情報のこと
$x_t$ は時刻 $t$ における入力データ＝英語の1単語
$y_t$ は時刻 $t$ における出力データ＝ドイツ語の1単語
$h_t$ は時刻 $t$ における隠れ層の中間データ＝時系列データ
$W_{hh}$ は $h_{t-1}$ から $h_t$ への伝搬時のパラメータ。重みのWeightの頭文字
$W_{xh}$ は $x_t$ から $h_t$ への伝搬時のパラメータ。重みのWeightの頭文字
$W_{hy}$ は $h_t$ から $y_t$ への伝搬時のパラメータ。重みのWeightの頭文字

です。RNNの一般的な構図です。
本文はこれのことを指してます。

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

このRNNのようなモデルは、時系列データに適しているのだが、その構造ゆえに、並列処理での学習を行うのが難しい。それは学習データの文章の長さが長くなってくると致命的。それはこのモデル上、長期の記憶ができないことを意味している。

ちょっと直訳だと意味不明だったので、意訳してます。

Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

直近の研究でも、原文のリファレンスにあるような研究で進歩を見せているものの、根本的な課題解決にはなってない。

意訳してますが、factorization tricksとか、conditional computationなる技術を使って、改善が努力されてるけど、そもそも論を考える時が来た、ということですね。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19].

なので、「Attention」を使ったモデルは、時系列データや翻訳タスクにおいて、説得力を持たせる上で、不可欠なものになってきた。それは、このモデルが、時系列データにおけるデータ間の遠さに強い、つまり記憶力の高いモデルだからだ。

In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

もちろん、一部の例では、RNNと一緒に使われることもある。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.

この論文では、「Transformer」というモデルを提案する。それはRNNを除外して、完全に「Attention」で構成されたモデルで、入力と出力データ間の依存関係をその位置の離れ方に依存せずに理解するモデルだ。

The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

またそのモデルの特性上、学習の並列処理が可能であり、ゆえに翻訳品質も過去にないレベルに達した。またその学習時間も、P100 GPU8個で12時間以下で出来た。

このGPUはおそらくNVIDIA製のGPUのことだと思いますが、調べてみると、1つ80万円程度かかるようです。これが8個ですから、640万円？？
今はどこも販売終了になっているので、恐らくさらに最新版が出て、廃版になっているのでしょう。それにしても高い。やってみれないですね。

価格を参考にしたサイト

ELSA
HPC向け超高速演算ボード NVIDIA Tesla P100 PCIe 16GB ETSP100-16GER

Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.

時系列データの計算時間削減というゴールは、拡張型ニューラルGPUや、ByteNet、ConvS2Sなどが提案された背景にもなっているのだが、それらはCNNをベースの1ブロックとして使っていて、そのため、入力と出力のすべての位置において、隠れ層の値を並列に処理している。

拡張型ニューラルGPU：　同じくGoogle Brainから2016年に公開された論文で提案されたモデル
ByteNet：　2016年にAlphaGoで有名なDeepMindが公開した論文で提案された翻訳モデル。2014年にはGoogleが買収してますから、Google内部といってもいいかもしれません（どれだけ内部で繋がってるかはわかりませんが）
ConvS2S：　2017年にFacebook AI Researchが公開した論文で提案されたモデル

以下の記事も参考にしました。

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.

そのため、上記のようなモデルでは、とにかく計算が大変だ。
入力や出力において、距離を隔てた任意の2つの単語を紐づけるには、ConvS2Sで1次関数的に、ByteNetで対数的に計算が増加する。

This makes it more difficult to learn dependencies between distant positions [12].

結局、それは、離れた位置にあるデータ間の依存関係を学習するのが難しいことを意味している。

In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

「Transformer」は、これが定数倍の計算にしかならない。
注目度の高い位置にあるデータも平均化してしまうことで、入力データに対するモデルの有効な解像度というものは犠牲になるのだが、これはまた、後述する「Multi-Head Attention」という仕組みで相殺する。

Multi-Head Attentionは後で出てくるので、そこでみていきましょう。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

「Self-Attention」という仕組みも使う。
これは「Intra-Attention」とも呼ばれたりするもので、離れた位置にあるデータの依存関係を計算する仕組みだ。

Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].

Self-Attentionは色々な分野でうまくいっている。
読解系のタスクや、要約系のタスク、内包する意味の理解、一見離れた依存関係の文章の理解、など。

具体的には以下の論文のことを指してるようです。いずれも有名なので、要望があれば解説したいです。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].

2015年にNew York UniversityとFacebook AI Researchが公開した論文によると、End-to-end memory networksは、再帰的なAttentionを使った仕組みで出来ていて、簡単な質問に対する回答や、言語モデルのタスクにおいてよい結果を残している。やはり、純粋なRNNではない。

この論文の以前にも同じような着眼点の研究があったようです。

End-To-End Memory Networks: 2015年にNew York UniversityとFacebook AI Researchが公開

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.

といっても、著者の知る限り、Transformerが、RNNやCNNなどを使わずに、完全にSelf-Attentionだけで出来た最初の翻訳モデルである。

In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].

この後の章で、Transformer、Self-Attentionを紹介し、他のモデルとの比較を行う。

おわり

AI界を席巻する「Transformer」を解説するシリーズ2日目は以上です。いよいよ、次回から中身に入ります。ここまででも十分な人もいると思いますが、さらに深ぼっていきます。

感想や要望・指摘等は、本記事へのコメントか、TwitterのリプライやDMでもお待ちしております！

次回、Model Architectureに進みます。

シリーズ関連記事はこちら

Slack版ChatGPT「Q」というサービスを開発・運営しています。
こちらもぜひお試しください。
https://q-bot.suchica.com/

Introduction

Background

おわり

Discussion