Transformerよりもシンプル?「MLP-Mixer」爆誕(5日目) ~Related Work / Conclusion編~

15 min read読了の目安(約13700字



  • 1日目: Abstract / Introduction
  • 2日目: Mixer Architecture
  • 3日目: Experiments 1
  • 4日目: Experiments 2
  • 5日目: Related Work / Conclusion
  • 6日目: Source Code

「MLP-Mixer: An all-MLP Architecture for Vision」の原文はこちらです。2021年5月4日にGoogle ResearchとGoogle Brainの混合チームから発表され、関係者のTwitterでもかなり話題になっています。


4 Related Work

Mixer is a new architecture for computer vision that differs from previous successful architectures because it uses neither convolutional nor self-attention layers.


Nevertheless, the design choices can be traced back to ideas from the literature on CNNs [24, 25] and Transformers [48].

とはいえ、このデザインの発想は、CNN[24, 25]やTransformer[48]の文献から得たアイデアに遡る。

Attention is all you need については過去記事もあります。参照ください。

CNNs have been the de facto standard in computer vision field since the AlexNet model [24] surpassed prevailing approaches based on hand-crafted image features, see [34] for an overview.


34: A. Pinz. Object categorization

An enormous amount of work followed, focusing on improving the design of CNNs.
We highlight only the directions most relevant for this work.


Simonyan and Zisserman [40] demonstrated that a series of convolutions with a small 3×3 receptive field is sufficient to train state-of-the-art models.

Simonyan and Zisserman [40]は、3×3の小さな受容野(フィルターでありカーネル)で一連の畳み込みを行うだけで、最先端の性能を持つモデル(SOTAモデル)に学習できることを示した。

40: K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image

Later, He et al. [15] introduced skip-connections together with the batch normalization layer [20], which enabled training of very deep neural networks with hundreds of layers and further improved performance.


A prominent line of research has investigated the benefits of using sparse convolutions, such as grouped [54] or depth-wise [9, 17] variants.

グループ化[54]や深さ方向[9, 17]など、スパース(重み行列の中の小さな値は0にしてしまって軽くした状態)な畳み込みを使用することの利点を研究しているものも著名です。

Finally, Hu et al. [18] and Wang et al. [51] propose to augment convolutional networks with non-local operations to partially alleviate the constraint of local processing from CNNs.


Mixer takes the idea of using convolutions with small kernels to the extreme:


by reducing the kernel size to 1×1 it effectively turns convolutions into standard dense matrix multiplications applied independently to each spatial location (channel-mixing MLPs).


This modification alone does not allow aggregation of spatial information and to compensate we apply dense matrix multiplications that are applied to every feature across all spatial locations (token-mixing MLPs).


In Mixer, matrix multiplications are applied row-wise or column-wise on the “patches×features” input table, which is also closely related to the work on sparse convolutions.

Mixerでは、「パッチ×フィーチャ」(=S \times C)の入力テーブルに対して、行ごとまたは列ごとに行列の乗算が行われますが、これはスパース・コンボリューションの研究とも密接に関連しています。

Finally, Mixer makes use of skipconnections [15] and normalization layers [2, 20].

最後に、Mixerはskip結合[15]と、正規化layer[2, 20]を利用しています。

The initial applications of self-attention based Transformer architectures to computer vision were for generative modeling [8, 32].

Self-AttentionレイヤーをベースとしたTransformerアーキテクチャのコンピュータビジョンへの初期の応用は、生成的モデリングのためのものでした[8, 32]。

Their value for image recognition was demonstrated later, albeit in combination with a convolution-like locality bias [36], or on very low-resolution images [10].


Recently, Dosovitskiy et al. [14] introduced ViT, a pure transformer model that has fewer locality biases, but scales well to large data.


14: A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale

ViT achieves state-of-the-art performance on popular vision benchmarks while retaining the robustness properties of CNNs [6].


6: S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit. Understanding robustness of transformers for image classification. arXiv preprint arXiv:2103.14586, 2021

Touvron et al. [47] showed that ViT can be trained effectively on smaller datasets using extensive regularization.


47: H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020

Mixer borrows design choices from recent transformer-based architectures;


the design of MLP-blocks used in Mixer originates from Vaswani et al. [48].


48: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need

Further, converting images to a sequence of patches and then directly processing embeddings of these patches originates in Dosovitskiy et al. [14].


14: An image is worth 16x16 words: Transformers for image recognition at scale

Similar to Mixer, many recent works strive to design more effective architectures for vision.


For example, Srinivas et al. [41] replace 3×3 convolutions in ResNets by self-attention layers.


Ramachandran et al. [36], Li et al. [26], and Bello [3] design networks that work well with new attention-like mechanisms.


Mixer can be seen as a step in an orthogonal direction, without reliance on locality bias and attention mechanisms.


Neyshabur [29] is closely related.
The authors devise custom regularization and optimization algorithms, to train a fully-connected network for vision.


It attains reasonable performance on smallscale image classification tasks.


Our architecture instead relies on token-mixing and channel-mixing MLPs, uses standard regularization and optimization techniques, and scales to large data effectively.


Traditionally, networks evaluated on ImageNet [13] are trained from-scratch using Inception-style pre-processing [45].

従来,ImageNet [13]で評価されるネットワークは,Inceptionスタイルの前処理 [45]を用いて0から学習されていました。

For smaller datasets, transfer of ImageNet models is popular.


However, modern state-of-the-art models typically use either weights pre-trained on larger datasets, or more recent data-augmentation and training strategies.


For example, Dosovitskiy et al. [14], Kolesnikov et al. [22], Mahajan et al. [28], Pham et al. [33], Xie et al. [53] all advance the state of the art in image classification using large-scale pre-trained weights.


Examples of improvements due to augmentation or regularization changes include Cubuk et al. [11], who attain excellent classification performance with learned data augmentation, and Bello et al. [4], who show that canonical ResNets are still near the state of the art, if one uses recent training and augmentation strategies.


5 Conclusions

We describe a very simple architecture for vision.

コンピュータ・ビジョンのための非常にシンプルなアーキテクチャについて説明した。それがMLP Mixer。

Our experiments demonstrate that it is as good as existing state-of-the-art methods in terms of the trade-off between accuracy and computational resources required for training and inference.


We believe these results open many questions.


On the practical side, it may be useful to study the features learned by the model and identify the main differences (if any) from those learned by CNNs and Transformers.


On the theoretical side, we would like to understand the inductive biases hidden in these various features and eventually their role in generalization.


Most of all, we hope that our results spark further research, beyond the realms of established models based on convolutions and self-attention.


It would be particularly interesting to see whether such a design works in NLP or other domains.