🐶

【ML Paper】Explanation of all of YOLO series Part 9

2024/11/25に公開

This is an summary of paper that explains yolov1 to v8 in one.
Let's see the history of the yolo with this paper.
This article is part 9, part 8 is here.

Original Paper: https://arxiv.org/pdf/2304.00501

5. YOLOv2: Better, Faster, and Stronger!

YOLOv2 introduced several enhancements over the original YOLO to improve performance while maintaining speed. Batch normalization was applied to all convolutional layers, facilitating better convergence and acting as a regularizer to reduce overfitting. A high-resolution classifier was implemented by pre-training the model with ImageNet at 224 \times 224 resolution and subsequently fine-tuning it for ten epochs at 448 \times 448, enhancing the network's performance on higher-resolution inputs.

The architecture was transformed into a fully convolutional network by removing dense layers, allowing for more flexible input sizes. Anchor boxes were utilized to predict bounding boxes, employing predefined shapes to match prototypical object shapes. Multiple anchor boxes were defined for each grid cell, with the network predicting the coordinates and class for every anchor box. Dimension clusters were determined using k-means clustering on the training bounding boxes, resulting in five prior boxes that provided a balanced tradeoff between recall and model complexity.

Direct location prediction was adopted, where the network predicted location coordinates relative to the grid cell. Specifically, the network outputs five bounding boxes per cell, each with five values: t_x, t_y, t_w, t_h, and t_o. These values correspond to the bounding box coordinates and objectness score, with the final bounding box coordinates derived as illustrated in Figure 8.

Finer-grained features were achieved by removing one pooling layer, resulting in a feature map of 13 \times 13 for input images of 416 \times 416. Additionally, a passthrough layer was introduced to combine a 26 \times 26 \times 512 feature map by stacking adjacent features into different channels, instead of losing them through spatial subsampling. This process generated a concatenated feature map of 13 \times 13 \times 3072. Multi-scale training was employed by varying the input size randomly between 320 \times 320 and 608 \times 608 every ten batches, enhancing the model's robustness to different input resolutions.

Performances

These improvements enabled YOLOv2 to achieve an average precision (AP) of 78.6% on the PASCAL VOC2007 dataset, a significant increase compared to YOLOv1's AP of 63.4%.

YOLOv2 successfully enhanced the original YOLO framework by incorporating batch normalization, high-resolution classifiers, fully convolutional architecture, anchor boxes, dimension clusters, direct location prediction, finer-grained features, and multi-scale training. These advancements collectively contributed to YOLOv2's superior performance in object detection tasks, demonstrating its ability to detect 9000 categories efficiently.

Discussion