【ML Paper】YOLO: Unified Real-Time Object Detection part5
This time, I'll explain the YOLO image detection model with paper.
This is a part5, part6 will publish soon.
Original paper: https://arxiv.org/abs/1506.02640
5. Inference
5.1 Inference
During inference, predicting detections for a test image requires only a single evaluation of the network, just like during training. On the PASCAL VOC dataset, the network predicts 98 bounding boxes per image along with class probabilities for each box. This efficiency makes YOLO extremely fast at test time compared to classifier-based methods, as it avoids multiple network evaluations.
The grid-based design enforces spatial diversity in bounding box predictions. Usually, it's clear which grid cell an object falls into, and the network predicts one box per object. However, for large objects or objects near the borders of multiple cells, multiple cells may localize the object well, leading to multiple detections. Non-maximal suppression can be applied to eliminate these duplicate detections, improving mean Average Precision (mAP) by about 2-3%, though it's not as critical for performance as in methods like R-CNN or DPM.
5.2 Limitations of YOLO
YOLO imposes strong spatial constraints on bounding box predictions because each grid cell predicts only two boxes and can only belong to one class. This constraint limits the model's ability to predict multiple nearby objects, making it struggle with small objects that appear in groups, such as flocks of birds.
Since the model learns to predict bounding boxes directly from data, it has difficulty generalizing to objects with new or unusual aspect ratios or configurations. Additionally, the model uses relatively coarse features for predicting bounding boxes due to multiple downsampling layers in the architecture, which affects localization precision.
Lastly, although the loss function approximates detection performance, it treats errors equally in small and large bounding boxes. A small localization error in a large box is generally benign, but the same error in a small box significantly affects the Intersection over Union (IOU) score. Consequently, the main source of error in YOLO is incorrect localization.
Discussion