🦏

【ML Paper】YOLOv2: part4

2024/11/03に公開

This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.

This article is part 4. Part 3 is here.

Original Paper: https://arxiv.org/abs/1612.08242

3.5 Convolutional With Anchor Boxes

In modifying the YOLO architecture, the fully connected layers were removed in favor of using anchor boxes to predict bounding boxes.
This approach aligns YOLO more closely with the methodology of Faster R-CNN, which utilizes a region proposal network (RPN) to predict offsets and confidences for predefined anchor boxes through convolutional layers.

To enhance the resolution of the convolutional feature maps, one pooling layer was eliminated, and the network was resized to accept input images of 416×416 pixels instead of the original 448×448. This adjustment ensures an odd number of locations in the feature map, resulting in a single center cell that is advantageous for accurately predicting objects, particularly large ones, that are typically centered in images.
The convolutional layers in YOLO downsample the input image by a factor of 32, producing a 13×13 output feature map. Additionally, the introduction of anchor boxes decouples class prediction from spatial location, enabling the model to predict class probabilities and objectness scores for each anchor box individually. The objectness score reflects the Intersection over Union (IOU) between the ground truth and the proposed box, while class predictions represent the conditional probability of each class given the presence of an object.

3.5.1 Results

Implementing anchor boxes resulted in a minor decrease in mean Average Precision (mAP) from 69.5 to 69.2. However, this modification significantly improved the recall from 81% to 88%.
While the number of predicted bounding boxes per image increased from 98 to over a thousand, the enhanced recall indicates that the model is better at identifying relevant objects, offering greater potential for further improvements despite the slight reduction in mAP.

3.5.2 Discussion

The integration of anchor boxes into the YOLO framework introduces a trade-off between precision and recall.
Although there is a marginal decrease in mAP, the substantial increase in recall suggests that the model becomes more effective at detecting objects, particularly those that occupy central positions within images.

This improvement in recall provides additional opportunities for enhancing the model's overall performance and robustness in various object detection scenarios.

Discussion