🐏

【ML Paper】DeiT: There are only fewer images for ViT? Part1

2024/10/04に公開

The summary of this paper, part1.
The authors proposed an improved Vision Transformer, DeiT(Data-Efficient image Transformer)

Original Paper: https://arxiv.org/abs/2012.12877v2

0. Abstract

The high performance of ViT requires hundreds of millions of images using a large infrastructure.
In this work, they produce convolution-free transformers by training on Imagenet only. It achieved a top-1 accuracy of 83.1%(single crop) with a single computer in less than 3days.

There is a teacher-student strategy for transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention.
This token-based distillation, especially when using a convnet as a teacher gives them 85.2% accuracy, competitive results to convnets.

1. Introduction

・Accurary and Throughput

・The throughput is measured as the number of images processed per second on a V100 GPU.
・DeiT-B is identical to VIT-B, but the training is more adapted to a data-starving regime.
・It is learned in a few days on one machine.
・The symbol ⚗refers to models trained with their transformer-specific distillation.

The vision transformer (ViT) introduced by Dosovitskiy et al. is an architecture directly inherited from Natural Language Processing.
The paper concluded that transformers “do not generalize well when trained on insufficient amounts of data”, and the training of these models involved extensive computing resources.

They build upon the visual transformer architecture from Dosovitskiy et al. and improvements included in the timm library. With their Data-efficient image Transformers
(DeiT), they report large improvements over previous results.
They introduce a token_based strategy specific to transformers and denoted by DeiT⚗, and show that it advantageously replaces the usual distillation.

In summary, their work makes the following contributions:
・Achieve competitive results against the state of the art on ImageNet with no external data.
・They introduce a new distillation procedure based on a distillation token, which plays the same role as the class token, except that it aims at reproducing the label estimated by the teacher. Both tokens interact in the
transformer through attention. This transformer-specific strategy outperforms vanilla distillation by a significant margin.
・It also works well for downstream tasks such as fine-grained classification.

This is the end of this part. I'll write the next part soon.

Discussion