🦏

【ML Paper】YOLO: Unified Real-Time Object Detection part9

2024/10/30に公開

This time, I'll explain the YOLO image detection model with paper.
This is part 9, and part 10 will be published soon.

Original paper: https://arxiv.org/abs/1506.02640

9. VOC 2012 Results

On the VOC 2012 test set, YOLO achieves a mean Average Precision (mAP) of 57.9%, trailing current state-of-the-art detection methods, with performance closer to that of the original R-CNN using VGG-16. YOLO encounters difficulty with small objects, especially in categories like "bottle," "sheep," and "tv/monitor," where it scores 8-10% lower than R-CNN or Feature Edit. However, YOLO performs competitively in categories like "cat" and "train," where it surpasses other methods.

A combined Fast R-CNN + YOLO model achieves high performance, with Fast R-CNN benefiting from a 2.3% mAP increase when integrated with YOLO, resulting in an improvement of five ranks on the public leaderboard.

Leaderboard:

9.1 Generalizability: Person Detection in Artwork

In object detection research, datasets typically use training and testing data from similar distributions. However, real-world applications often encounter domain shifts in data distribution, making performance on novel test data unpredictable. To evaluate YOLO’s robustness, comparisons were made against other methods on the Picasso and People-Art datasets, which feature artwork images focused on person detection.

When models trained on VOC 2007 data were tested on artwork, R-CNN, which uses Selective Search for bounding box proposals tuned for natural images, exhibited a sharp decline in accuracy. R-CNN’s reliance on precise bounding box proposals limited its adaptability to new domains.

Conversely, DPM (Deformable Part Model) demonstrated better generalizability due to its strong spatial models of object shape and layout, though it begins from a lower initial AP. YOLO showed promising adaptability, maintaining better performance on artwork than both R-CNN and DPM. YOLO’s capability to model object size, shape, and contextual relationships aids its detection accuracy across different visual domains. While artwork images differ from natural images at the pixel level, the consistency in object size and shape across domains enables YOLO to generate reliable bounding boxes and accurate detections.

Discussion