🐹

AI界を席巻する「Transformer」をゆっくり解説(8日目) ～Results / Conclusion編～

2021/05/13に公開

AI界を席巻する「Transformer」を解説するシリーズ8日目です。

Attention Is All You Needの論文PDFはこちら

1日目：　Abstract
2日目：　Introduction / Background
3日目：　Model Architecture 1
4日目：　Model Architecture 2
5日目：　Model Architecture 3
6日目：　Why Self-Attention
7日目：　Training
8日目：　Results / Conclusion
9日目：　Source Code

シリーズ過去記事は一番下にリンク貼ってます。
それでは早速みていきましょう。

6 Results

～Abstract編～でも少し書きましたが、もう少し詳しい結論です。

6.1 Machine Translation

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.

WMT 2014の英独翻訳において、表2に表すように、Transformer（Big）が、Ensemblesを含めたこれまで紹介してきたモデルよりも、BLEUスコアで2.0良い成績をたたき出した。BLEUスコアは28.4

一番下の行ですね。Transformerは本論文がBaseモデルと定義したものと、それよりさらに大きな学習データで学習したBigモデルがありました。

The configuration of this model is listed in the bottom line of Table 3.

BaseモデルとBigモデルは下記の表に、そのモデルのパラメータがあります。Bigモデルは学習のステップ数が多いだけじゃなくて、翻訳マシンが言語の特徴をつかむ上での表現力となる次元 $d_{model}$ や $d_{ff}$ も2倍、Multi-Head Attentionの中のレイヤー数 $h$ も2倍、過学習を防ぐためのドロップ率 $P_{drop}$ の確率も0.1がBaseモデルにおいても、一般的にも使われるが、ここは3倍の0.3。

表上で、数字が書いてないところは、Baseモデルと同じです。

Training took 3.5 days on 8 P100 GPUs.

学習時間はBigモデルはNVIDIA製のP100のGPUを8個使って3.5日。これは2回ほどこれまで出てきましたので割愛。

Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

つまり、これまで発表されたどのモデルの性能も凌駕しているのに、かつ計算時間はそれに比べてわずか数分の1、という偉業を達成した。

これによって、Transformerは有名になりました。

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model.

同様に、英仏の翻訳タスクにおいても、BLEUスコア41.0を達成。他も競合モデルを大きく凌駕しつつも、計算時間はそれの4分の1という偉業を達成。

ドイツ語でもフランス語でも関係なくいけた。

The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.

細かい話をすると、英仏翻訳におけるBigモデルのドロップ率は0.3ではなく、0.1を使用した。

ここは統一しろよ、と思いました・・・変えたかったんでしょうね。Apple to Appleの比較が難しくなりました。

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals.

Baseモデルにおいて、10分間隔のチェックポイントで保存される最後の5回分のモデルを平均化したモデルとした。

チェックポイントとは、ニューラルネットワークの学習、計算において、今回の論文モデルにしても、かなりの時間がかかるため、もし途中でサーバーの例えばリソース障害があった時に、それまでの計算がすべて無駄になるリスクを減らすために、途中途中でそれまで計算された最適化途中の重みなどのパラメータを保存しておくことを言います。

こちらのページを参考にしています。

For the big models, we averaged the last 20 checkpoints.

Bigモデルにおいては、最後の20回分とした。

We used beam search with a beam size of 4 and length penalty α = 0.6 [38].

2016年にGoogleが公開した論文「Google’s neural machine
translation system: Bridging the gap between human and machine translation」を参考に、「Beam Search」を使用、Beam Sizeは4、Length Penaltyのαは0.6を採用した。

Beam Searchとは、探索法の1種で、例えば文章のような時系列データの予測、次に何の文字列が来るのか？を調べる際に、次に来そうな文字列の可能性や確率を計算して、それが高い上位の文字列に進んでいく探索法です。

Beam Searchを使用する際のパラメータにBeam SizeやLength NormalizationやCoverage Penaltyなどがあります。例えば、Beam Sizeは、次に来る文字列を予想する際に、上位何個まで可能性に含めるのか？の数です。今回のように4であれば、4つの文字列まで可能性に入れて、他の文字列（アルファベットなら残り22）は可能性から捨てる、という意味のパラメータです。

詳しくは、こちらの記事が参考になりました。ちゃんと理解するためには、元論文を参考にした方がよいです。

These hyperparameters were chosen after experimentation on the development set.

これらのハイパーパラメータは、開発セットでの実験を経て、決まっています。

We set the maximum output length during inference to input length + 50, but terminate early when possible [38].

また、推論時の出力の文章の長さは、入力した文章の単語の大きさの＋50を最大値として設定しつつ、可能な場合は、なるべくそれより短くなるようにした。

翻訳した結果、文章がやけに長くなってしまったら、だめですからね。

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature.

下記の表のとおり、今回のモデルと参考文献のモデルにおける、翻訳の品質と計算コストをのせる。

We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU 5.

モデルの学習に使用された計算コストは推定である。計算コストは、FLOPs（Floating Point Operations Per Secondの略）で表現した。これは1秒間に何回浮動小数点演算を行えるかという、コンピュータの計算処理能力を表す単位である。

各モデルのFLOPsは、学習時間、GPUの使用数、各GPUが持続できる単精度の浮動小数点演算能力の推定値を使って推定した。

Apple to Appleの比較をしないと意味がないのですが、各論文でのマシンの性能には当然違いがあるので、それをならすために、上記の3つの要素から推定値を出した、ということです。使用されたGPUの数は当然その研究の予算や研究グループのリソース次第ですし、GPU1つの能力自体もNVIDIA製のGPUを使うにしても、どのモデルのGPUなのかによっても異なるからです。

この比較の計算式がないですが、ここの比較は結論を左右する大事な推定なので、実際にどうやって計算したのか、その妥当性も記載してほしかったものです。

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013.

Transformer内の各部品、各構成要素がそれぞれどれだけ重要な役割をしているか、を評価するために、Baseモデルの各構成要素を少しずつ変えてみて、その能力の変化を、開発学習セットであるNewstest2013の英独翻訳データを使った。

We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.

その際、先に紹介したBeam Searchは使用したが、チェックポイントの平均は使用しなかった。結果は下記の表のとおり。

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2.

表の（A）は、前のセクションで説明した通り、計算量は一定に保ったまま、Multi-Head Attentionのレイヤー数 $h$ や、そのキーやキー値の次元 $d_k$ 、 $d_v$ を4パターンほどに変えて試した。

While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

$h = 1$ が、つまりSingle-Head Attentionなわけだが、一番BLEUスコアがよかった $h = 16$ に比較して0.9ポイント低かった。また、Attentonレイヤーを増やしすぎても、 $h = 32$ のようにスコアは悪化した。

4の倍数である必要があるのかは明記がないのですが、恐らくもっと細かく試すのはちょっと（どころじゃないかもしれないが）骨が折れるのでやってないのだと思います。

In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality.

表の（B）では、Attentonのキーの次元 $d_k$ をBaseモデルより減らした場合に、モデルの性能に影響を与えるか見てみた。

This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial.

この結果からは、結果に大差が見られなかったので、優劣をつけることは難しかった。どちらかというと、今回のScaled Dot-Product Attentionよりも、もっと有効な変換関数とモデルがあるのかも、という風に考えられた。

ここは、こう帰結された理由が明記されておらず、わかりませんでした。

We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting.

次に、表の（C）と（D）で表す行で、Baseモデルよりも、各パラメータが小さい場合と大きい場合も実験してみた。結果は予想通り、パラメータの値が大きくなる、つまり、より大きなモデルになるほど、性能があがることが分かった。また、Dropoutレイヤーはやはり過学習を防ぐために、とても重要である、ということも再認識できた。

とありますが、EncoderとDecoderの中のレイヤーの数 $N$ は6が丁度いいようで、8に大きくなったからと言って、性能はBLEUスコアで見ると、あがってないようです。Nはずいぶん前に出てきた変数で、図で言うとこのレイヤーのStack数のことです。

モデルの次元数 $d_{model}$ などは一応Baseモデルの512よりも1024の方がBLEUスコアは0.2だけ上のようです。次元数があがると、マシン内で認識できる特徴量が増えて、いわゆる表現力、というものが増えますから、スコアが高くなるのは当たり前といえば当たり前かもしれません。実際、Bigモデルではこの1024が採用されてます。

FFNの次元数は4倍にするとBaseモデルを超えるようで、その結果Bigモデルに採用されてますね。

Drop率は確かに0にしてしまうと、スコアが悪くなってますが、0.2に増やした場合もスコアは悪くなってるようです。しかし、Bigモデルでは0.3が採用されています。

In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.

行（E）では、本論文のサイン関数的な位置エンコーディングと、別論文で書かれた位置エンコーディングとを比較した。結果、違いはまったく見られなかった。が、前のセクションで述べた通り、本論文のモデルの方が文章の長さが長くなった時に有効な可能性があったので、自前のモデルを採用してる。

6.3 English Constituency Parsing

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing.

このTransformerが、他の翻訳タスクにも汎用的に使えるかどうか評価するために、英語の構文解析の実験も行った。

構造解析なので、翻訳とはまた別の話です。

This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input.

この実験は、（翻訳とは違う意味で）特殊な課題を示した。というのは、出力結果は入力の構造に強い制約を受けること、そして（構造によっては）入力よりもかなり長い出力結果になってしまう、ということだ。

構文解析ですから、確かに入力された文章の構造に大きく影響を受けそうです。

Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].

RNNのSeq2Seqモデル（を比較のために構文解析にも使ってみたところ）、データ量が少ない場合には、（そもそも翻訳で得られていたような）圧倒的によい結果を得ることは出来なかった。

We trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences.

また、レイヤー数4、次元数 $d_{model} = 1024$ のTransformerを、Wall Street Journal紙が提供するPenn Treebank（大規模な注釈付きの英語データ）から40,000の学習用データの文章で学習させてみた。

We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37].

また，約1,700万の文章のBerkleyParserコーパスとHigh-confidenceコーパスを用いて，半教師付きで学習を行いました。

コーパスとは、言語用の大規模な学習用のデータセットのことです。

We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

WSJのみの設定の方は16,000トークンの語彙を使用し、半教師付きの方は32,000トークンの語彙を使用した。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model.

Drop率や、Attentionレイヤー、残差接続、学習率、ビームサイズに関して設定を決めるために少数の実験を行った。そのほかのパラメータは、英独のBaseモデルからは変えなかった。

During inference, we increased the maximum output length to input length + 300.

推論時、出力結果の翻訳文章の単語長は、入力より＋300を制限とした。

We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting.

ビームサイズは21、αは0.3にした。

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].

その結果が下記の表だ。

結果、特にこの構文解析用のタスクチューニングを（多少はやってるが）特に行ったわけではないのに、かなりいい成績を残せた。過去の他の論文でも公開されてるモデルよりも良かった。一部例外もあった。

この表はよくみると成績順ではないので、確かに、RNNで構成された4行目のモデルには、Transformerは少し負けてるようです。

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.

RNNのSeq2Seqモデルとは対照的に、TransformerはBerkeleyParserを性能的に上回った。しかも、学習はWSJの40,000文章でしかやってないのに。

7 Conclusion

まとめ。

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

本論文では、Transformerを提案した。これは完全にAttentionだけで構成された。つまり、RNNレイヤーという最もよくEncoderとDecoderで使われているアーキを採用せずに、Multi-Head Attentionだけで作った。

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.

翻訳をしてみたところ、RNNやCNNレイヤーで作られたアーキと比較して、かなり早く学習できる。

On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art.

英独と英仏の翻訳において、過去最高記録を樹立した。

In the former task our best model outperforms even all previously reported ensembles.

英独翻訳においては、特によい結果を残せた。

We are excited about the future of attention-based models and plan to apply them to other tasks.

Attentionモデルの将来性にはかなり期待できるし、翻訳以外のタスクにも適用することを検討してる。

We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

それはテキスト以外の入出力にも広げられるだろう。それは例えば、画像や音声、動画も含まれる。

そして、実際に2017年以降、画像など他の分野に適用されます。

The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.

学習と評価に使ったコードは上記のリンクからどうぞ。

おわり

AI界を席巻する「Transformer」を解説するシリーズ8日目は以上です。これで論文パートは完了です。ソースパートは要望があれば作成予定です！

感想や要望・指摘等は、本記事へのコメントか、TwitterのリプライやDMでもお待ちしております！

また、結構な時間を費やして書いていますので、投げ銭・サポートの程、よろしくお願いいたします！

シリーズ関連記事はこちら

【2023年5月追記】
また、Slack版ChatGPT「Q」というサービスを開発・運営しています。
こちらもぜひお試しください。
https://q-bot.suchica.com/