【CV】What is the object detection model? Part1
Object detection explanation part1.
1. Object detection
Object detection is the task of detecting "what is where".
Structure
Many object detection models have a backbone, neck, and head structure.
Backbone (resnet, efficientnet, etc.): Feature extraction
Neck (FPN, Bi-FPN, etc.): Feature transformation
Head: Detects position and orientation based on the neck output
Backbone
The foundation of the model that extracts features, a pre-trained model is used.
Famous models:
・ResNet (2015):
A model called AlexNet was the first to use DNN to achieve a high score in a global image competition. There are 8 layers, and if the number is increased too much, gradient vanishing will occur.
ResNet succeeded in deepening the layers by using Residual Connection.
・EfficientNet (2019):
By optimizing the depth and width, it is possible to achieve good results with fewer parameters.
・VIT:
Converts the image into small patches, recognizes them as one element, and uses Transformer for the image.
・DeiT:
A large number of images were required to exceed the performance of CNN with Vit, but this was solved by using knowledge distillation.
・SwinTransformer:
Introduced patch merging and shifted windows to support various resolutions and reduce the amount of calculations.
・MaxVit
Improved by combining local and global attention to extract better features.
・ConvNeXt:
Improved performance by reviewing the design of conventional CNN and incorporating the advantages of Transformer.
BottleNeck
・FPN(2017):
Improved multi-scale performance (ability to detect objects of various sizes) by skip connection and Encoder-Decoder structure.
・EfficientNet(2020)
Uses EfficientNet as backbone, effectively mixes feature maps of multiple scales with BiFPN.
・DETR(2020)
The first model to use Transformer as a neck.
・DINO
A model that solves the problems of DETR (difficulty in scaling performance, slow convergence). Improved and combined Deformable DETR and DN-DETR methods.
Discussion