🦌

【CV】What is the object detection model? Part1

2024/10/11に公開

Object detection explanation part1.

1. Object detection

Object detection is the task of detecting "what is where".

Structure

Many object detection models have a backbone, neck, and head structure.
Backbone (resnet, efficientnet, etc.): Feature extraction
Neck (FPN, Bi-FPN, etc.): Feature transformation
Head: Detects position and orientation based on the neck output

Backbone

The foundation of the model that extracts features, a pre-trained model is used.

Famous models:
・ResNet (2015):
A model called AlexNet was the first to use DNN to achieve a high score in a global image competition. There are 8 layers, and if the number is increased too much, gradient vanishing will occur.
ResNet succeeded in deepening the layers by using Residual Connection.
・EfficientNet (2019):
By optimizing the depth and width, it is possible to achieve good results with fewer parameters.
・VIT:
Converts the image into small patches, recognizes them as one element, and uses Transformer for the image.
・DeiT:
A large number of images were required to exceed the performance of CNN with Vit, but this was solved by using knowledge distillation.

・SwinTransformer:
Introduced patch merging and shifted windows to support various resolutions and reduce the amount of calculations.

・MaxVit
Improved by combining local and global attention to extract better features.

・ConvNeXt:
Improved performance by reviewing the design of conventional CNN and incorporating the advantages of Transformer.

BottleNeck

・FPN(2017):
Improved multi-scale performance (ability to detect objects of various sizes) by skip connection and Encoder-Decoder structure.

・EfficientNet(2020)
Uses EfficientNet as backbone, effectively mixes feature maps of multiple scales with BiFPN.

・DETR(2020)
The first model to use Transformer as a neck.

・DINO
A model that solves the problems of DETR (difficulty in scaling performance, slow convergence). Improved and combined Deformable DETR and DN-DETR methods.

Discussion