🦏

【ML Paper】YOLOv2: part7

2024/11/08に公開

This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.

This article is part 7. Part 6 is here.

Original Paper: https://arxiv.org/abs/1612.08242

Multi-Scale Training

YOLOv2 improves robustness to varying image sizes through a multi-scale training approach. Unlike the original YOLO with a fixed input resolution of 448 \times 448 pixels, YOLOv2 uses anchor boxes and sets the input size to 416 \times 416 pixels.
Utilizing only convolutional and pooling layers allows the model to resize dynamically during training. Every 10 batches, the network randomly selects a new image dimension from multiples of 32 within the range of 320 \times 320 to 608 \times 608 pixels. This strategy ensures the network learns to predict accurately across different input sizes, enabling the same model to handle various resolutions efficiently.

At lower resolutions, such as 288 \times 288 pixels, YOLOv2 achieves over 90 frames per second (FPS) with a mean Average Precision (mAP) comparable to Fast R-CNN, making it suitable for high-framerate applications or deployment on smaller GPUs.
At higher resolutions, YOLOv2 reaches a state-of-the-art mAP of 78.6 on the PASCAL VOC 2007 dataset while maintaining real-time processing speeds. This flexibility allows users to balance speed and accuracy based on their specific needs.

Further Experiments

YOLOv2's performance was further tested on additional datasets. When trained on the PASCAL VOC 2012 dataset, YOLOv2 achieved a mAP of 73.4, outperforming other detection systems in speed. On the COCO dataset, YOLOv2 reached a mAP of 44.0 at an Intersection over Union (IOU) threshold of 0.5, matching the performance of SSD and Faster R-CNN models.

Table below compares YOLOv2's performance with other detection frameworks on the PASCAL VOC 2007 dataset:

Detection Framework Train mAP FPS
Fast R-CNN 2007+2012 70.0 0.5
Faster R-CNN VGG-16 2007+2012 73.2 7
Faster R-CNN ResNet 2007+2012 76.4 5
YOLO 2007+2012 63.4 45
SSD300 2007+2012 74.3 46
SSD500 2007+2012 76.8 19
YOLOv2 288 × 288 2007+2012 69.0 91
YOLOv2 352 × 352 2007+2012 73.7 81
YOLOv2 416 × 416 2007+2012 76.8 67
YOLOv2 480 × 480 2007+2012 77.8 59
YOLOv2 544 × 544 2007+2012 78.6 40

All metrics are measured on a Geforce GTX Titan X GPU (original model, not Pascal). YOLOv2 consistently outperforms previous detection methods in both speed and accuracy.

Its ability to operate at different resolutions with the same trained model allows YOLOv2 to maintain high mAP scores while running in real-time, making it versatile for applications like real-time video processing and deployment on resource-constrained hardware.

Discussion