🐶

【ML Paper】Explanation of all of YOLO series Part 5

2024/11/21に公開

This is an summary of paper that explains yolov1 to v8 in one.
Let's see the history of the yolo with this paper.
This article is part 5, part 4 is here.

Original Paper: https://arxiv.org/pdf/2304.00501

4 YOLO: You Only Look Once

Real-Time End-to-End Approach

YOLO, introduced by Joseph Redmon et al. in CVPR 2016, marked the first real-time end-to-end method for object detection. Unlike previous techniques that relied on sliding windows with classifiers or multi-step processes involving region proposals followed by classification, YOLO streamlined detection into a single network pass. This approach simplified the output mechanism by using regression to predict detection outputs, contrasting with methods like Fast R-CNN, which employed separate classification and regression outputs.

4.1 How YOLOv1 works?

YOLOv1 integrates all object detection stages by simultaneously predicting all bounding boxes. The input image is divided into an S \times S grid, with each grid cell responsible for predicting B bounding boxes and their confidence scores across C classes. Each bounding box prediction includes five values: P_c, b_x, b_y, b_h, and b_w. Here, P_c represents the confidence score indicating the likelihood and accuracy of an object within the box. The coordinates b_x and b_y denote the box's center relative to the grid cell, while b_h and b_w specify the box's height and width relative to the entire image. The final output is a tensor of dimensions S \times S \times (B \times 5 + C), typically followed by non-maximum suppression (NMS) to eliminate duplicate detections.

In the original implementation, YOLOv1 utilized the PASCAL VOC dataset, which includes 20 classes (C = 20). The model employed a grid size of 7 \times 7 (S = 7) with up to 2 bounding boxes per grid cell (B = 2), resulting in an output prediction size of 7 \times 7 \times 30. For illustrative purposes, a simplified output with a 3 \times 3 grid, three classes, and a single class per grid cell would produce an output of 3 \times 3 \times 8.

Performance on PASCAL VOC2007

YOLOv1 achieved an average precision (AP) of 63.4 on the PASCAL VOC2007 dataset, demonstrating its effectiveness in real-time object detection tasks.


Discussion