【ML Paper】DeiT: There are only fewer images for ViT? Part2
The summary of this paper, part2.
The authors proposed an improved Vision Transformer, DeiT(Data-Efficient image Transformer)
Original Paper: https://arxiv.org/abs/2012.12877v2
2. Terms
2.1 Knowledge Distillaton
Knowledge Distillation is a method that uses the teacher model's output as an auxiliary loss of the student model.
・Knowledge Distillation Image
[1]
2.2 Class Token
Class Token is a vector that represents the information of all the other tokens in sequence by self-attention with them.
The class token acts as a way to aggregate information from all tokens and is designed specifically to handle this task. For tasks like classification, this final representation of the class token is used as the input to a classifier (a fully connected layer followed by softmax, for example) to make a prediction.
Discussion