🦏

【ML Paper】YOLOv2: part7

2024/11/08に公開

This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.
This article is part 7. Part 6 is here.
Original Paper: https://arxiv.org/abs/1612.08242

 Multi-Scale TrainingYOLOv2 improves robustness to varying image sizes through a multi-scale training approach. Unlike the original YOLO with a fixed input resolution of 448 \times 448 pixels, YOLOv2 uses anchor boxes and sets the input size to 416 \times 416 pixels.

Utilizing only convolutional and pooling layers allows the model to resize dynamically during training. Every 10 batches, the network randomly selects a new image dimension from multiples of 32 within the range of 320 \times 320 to 608 \times 608 pixels. This strategy ensures the network learns to predict accurately across different input sizes, enabling the same model to handle various resolutions efficiently.
At lower resolutions, such as 288 \times 288 pixels, YOLOv2 achieves over 90 frames per second (FPS) with a mean Average Precision (mAP) comparable to Fast R-CNN, making it suitable for high-framerate applications or deployment on smaller GPUs.

At higher resolutions, YOLOv2 reaches a state-of-the-art mAP of 78.6 on the PASCAL VOC 2007 dataset while maintaining real-time processing speeds. This flexibility allows users to balance speed and accuracy based on their specific needs.

 Further ExperimentsYOLOv2's performance was further tested on additional datasets. When trained on the PASCAL VOC 2012 dataset, YOLOv2 achieved a mAP of 73.4, outperforming other detection systems in speed. On the COCO dataset, YOLOv2 reached a mAP of 44.0 at an Intersection over Union (IOU) threshold of 0.5, matching the performance of SSD and Faster R-CNN models.
Table below compares YOLOv2's performance with other detection frameworks on the PASCAL VOC 2007 dataset:


Detection Framework
Train
mAP
FPS


Fast R-CNN
2007+2012
70.0
0.5

Faster R-CNN VGG-16
2007+2012
73.2
7

Faster R-CNN ResNet
2007+2012
76.4
5

YOLO
2007+2012
63.4
45

SSD300
2007+2012
74.3
46

SSD500
2007+2012
76.8
19

YOLOv2 288 × 288
2007+2012
69.0
91

YOLOv2 352 × 352
2007+2012
73.7
81

YOLOv2 416 × 416
2007+2012
76.8
67

YOLOv2 480 × 480
2007+2012
77.8
59

YOLOv2 544 × 544
2007+2012
78.6
40

All metrics are measured on a Geforce GTX Titan X GPU (original model, not Pascal). YOLOv2 consistently outperforms previous detection methods in both speed and accuracy.
Its ability to operate at different resolutions with the same trained model allows YOLOv2 to maintain high mAP scores while running in real-time, making it versatile for applications like real-time video processing and deployment on resource-constrained hardware.

Detection Framework	Train	mAP	FPS
Fast R-CNN	2007+2012	70.0	0.5
Faster R-CNN VGG-16	2007+2012	73.2	7
Faster R-CNN ResNet	2007+2012	76.4	5
YOLO	2007+2012	63.4	45
SSD300	2007+2012	74.3	46
SSD500	2007+2012	76.8	19
YOLOv2 288 × 288	2007+2012	69.0	91
YOLOv2 352 × 352	2007+2012	73.7	81
YOLOv2 416 × 416	2007+2012	76.8	67
YOLOv2 480 × 480	2007+2012	77.8	59
YOLOv2 544 × 544	2007+2012	78.6	40

Multi-Scale Training

Further Experiments

Discussion