🐶

【ML Paper】Explanation of all of YOLO series Part 10

2024/11/28に公開

This is an summary of paper that explains yolov1 to v8 in one.

Let's see the history of the yolo with this paper.

This article is part 10, part 9 is here.
Original Paper: https://arxiv.org/pdf/2304.00501

 6. YOLOv3
 6.0.1 Bounding Box Prediction
YOLOv3 extends the bounding box prediction mechanism from YOLOv2 by forecasting four coordinates (t_x, t_y, t_w, t_h) for each bounding box. Additionally, it assigns an objectness score to each bounding box using logistic regression. This score is set to 1 for the anchor box with the highest overlap with the ground truth and 0 for the remaining anchor boxes. Unlike Faster R-CNN, YOLOv3 assigns only one anchor box to each ground truth object. If no anchor box is assigned, only classification loss is incurred, avoiding localization or confidence loss.

 6.0.2 Class Prediction
Transitioning from a softmax approach, YOLOv3 utilizes binary cross-entropy to train independent logistic classifiers, framing the problem as a multilabel classification task. This modification allows multiple labels to be assigned to the same bounding box, accommodating complex datasets with overlapping labels. For instance, a single object can be classified simultaneously as both a "Person" and a "Man."

 6.0.3 New Backbone
YOLOv3 introduces a more robust backbone composed of 53 convolutional layers integrated with residual connections. This enhanced feature extractor aligns YOLOv3 with state-of-the-art performance benchmarks while preserving real-time processing capabilities. Detailed architecture specifications are further elaborated in Section 6.1.

 6.0.4 Spatial Pyramid Pooling (SPP)
Although not initially highlighted in the original paper, YOLOv3 incorporates a modified Spatial Pyramid Pooling (SPP) block within its backbone. This block concatenates multiple max pooling outputs with varying kernel sizes (k \times k, where k = 1, 5, 9, 13) without subsampling (stride = 1), thereby increasing the receptive field. The variant known as YOLOv3-spp demonstrated superior performance, enhancing the Average Precision at 50% Intersection over Union (AP50) by 2.7%.

 6.0.5 Multi-scale Predictions
Inspired by Feature Pyramid Networks, YOLOv3 implements multi-scale predictions by forecasting three bounding boxes at three different scales. This approach improves detection accuracy across objects of varying sizes.

 6.0.6 Bounding Box Priors
YOLOv3 employs k-means clustering to determine the bounding box priors for anchor boxes. Unlike YOLOv2, which utilizes five prior boxes per cell, YOLOv3 reduces this number to three prior boxes distributed across three distinct scales. This adjustment optimizes the model's ability to generalize across different object sizes and scales.

 6.0.7 Architecture Overview
Table 2 presents the YOLOv2 architecture, detailing the Darknet-19 backbone and the detection head. The backbone comprises layers 1 to 23, while the detection head includes the final four convolutional layers and a passthrough layer that reorganizes features for concatenation and final prediction. The architecture facilitates real-time object detection by generating a grid of 13 × 13 with 125 channels, accommodating 25 predictions (5 coordinates + 20 classes) for five bounding boxes.

6. YOLOv3

6.0.1 Bounding Box Prediction

6.0.2 Class Prediction

6.0.3 New Backbone

6.0.4 Spatial Pyramid Pooling (SPP)

6.0.5 Multi-scale Predictions

6.0.6 Bounding Box Priors

6.0.7 Architecture Overview

Discussion