🐶

【ML Paper】Explanation of all of YOLO series Part 11

2024/11/28に公開

This is an summary of paper that explains yolov1 to v8 in one.
Let's see the history of the yolo with this paper.
This article is part 11, part 10 is here.

Original Paper: https://arxiv.org/pdf/2304.00501

6.1 YOLOv3 Architecture

The backbone architecture of YOLOv3, known as Darknet-53, replaces all max-pooling layers with strided convolutions and incorporates residual connections.
This design comprises a total of 53 convolutional layers, each equipped with batch normalization and Leaky ReLU activation functions.
Residual connections link the inputs of the 1 × 1 convolutions to the outputs of the 3 × 3 convolutions throughout the network, enhancing feature propagation and gradient flow.

Darknet-53 achieves Top-1 and Top-5 accuracies comparable to ResNet-152 while operating nearly twice as fast, making it a highly efficient feature extractor.
Figure 9 illustrates the detailed architecture of Darknet-53, highlighting that it solely represents the backbone and does not include the detection head responsible for multi-scale predictions.

6.2 YOLOv3 Multi-Scale Predictions

A pivotal enhancement in YOLOv3 is its multi-scale prediction capability, which involves making predictions at multiple grid sizes.
This approach addresses the limitations of previous YOLO versions by enabling the detection of smaller objects and producing more precise bounding boxes.

The multi-scale detection architecture operates by generating three separate outputs at different scales, as depicted in Figure 10. The first output, ( y_1 ), corresponds to a 13 × 13 grid similar to YOLOv2.
The second output, ( y_2 ), is formed by concatenating feature maps from different stages of Darknet-53, specifically after (Res × 4) and (Res × 8), with an upsampling operation applied to align the feature map sizes to 26 × 26.
The third output, ( y_3 ), further concatenates the 26 × 26 feature maps with those at 52 × 52 after another upsampling step. For the COCO dataset, which includes 80 categories, each scale generates an output tensor shaped ( N \times N \times [3 \times (4 + 1 + 80)] ), where ( N \times N ) represents the grid size, 3 denotes the number of bounding boxes per cell, and ( 4 + 1 + 80 ) accounts for the four bounding box coordinates, the objectness score, and the class probabilities, respectively.

This multi-scale strategy significantly enhances YOLOv3's ability to accurately detect objects of varying sizes across different spatial resolutions.

Discussion