【ML Paper】DeiT: There are only fewer images for ViT? Part5
The summary of this paper, part5.
The authors proposed an improved Vision Transformer, DeiT(Data-Efficient image Transformer)
Original Paper: https://arxiv.org/abs/2012.12877v2
4.2 Comparison of distillation methods
They compared the performance of different distillation strategies.
・Hard distillation significantly outperforms soft distillation when even using only a class token.
・ Proposed distillation strategy further improves the performance, showing that the two tokens provide complementary information useful for classification: the classifier on the two tokens is significantly better than the independent class and distillation classifiers.
4.3 Agreement with the teacher & inductive bias?
The architecture of the teacher has an important impact.
Does it inherit existing inductive bias that would facilitate the training?
The proposed distilled model is more correlated to the convnet than with a transformer learned from scratch. As to be expected, the classifier associated with the distillation embedding is closer to the convnet that the one associated with
the class embedding, and conversely the one associated with the class embedding is more similar to DeiT learned without distillation.
・Disagreement analysis between convnet, image transformers and distillated transformers.
4.5 Number of epochs
Increasing the number of epochs significantly improves the performance of training with distillation.
With 300 epochs, proposed distilled network DeiT-B⚗ is already better than DeiT-B. But while for the latter the performance saturates with longer schedules, proposed distilled network clearly benefits from a longer training time.
・The proposed method performs better when the training time is long.
Discussion