🐶

【ML Paper】Explanation of all of YOLO series Part 5

2024/11/23に公開

This is an summary of paper that explains yolov1 to v8 in one.
Let's see the history of the yolo with this paper.
This article is part 5, part 4 is here.

Original Paper: https://arxiv.org/pdf/2304.00501

4 YOLO: You Only Look Once

Real-Time End-to-End Approach

YOLO, introduced by Joseph Redmon et al. in CVPR 2016, was the first to present a real-time end-to-end approach for object detection. The acronym YOLO stands for "You Only Look Once," highlighting its capability to perform object detection in a single network pass. This contrasts with previous methods that relied on sliding windows with classifiers or multi-step processes involving region proposals followed by classification. Additionally, YOLO employs a straightforward regression-based output to predict detection results, differing from methods like Fast R-CNN that separate classification and bounding box regression into distinct outputs.

4.1 How YOLOv1 works?

Unified Detection Process

YOLOv1 streamlines object detection by predicting all bounding boxes simultaneously. It divides the input image into an S \times S grid, where each grid cell predicts B bounding boxes along with confidence scores for C different classes. Each bounding box prediction includes five values: P_c, b_x, b_y, b_h, and b_w. Here, P_c represents the confidence score indicating the likelihood of an object and the accuracy of the bounding box. The coordinates b_x and b_y denote the box center relative to the grid cell, while b_h and b_w represent the height and width relative to the image.

Output Structure

The output of YOLO is a tensor of dimensions S \times S \times (B \times 5 + C), which can be followed by non-maximum suppression (NMS) to eliminate duplicate detections. In the original YOLO implementation, the PASCAL VOC dataset was used, consisting of 20 classes (C = 20), a grid size of 7 \times 7 (S = 7), and up to 2 bounding boxes per grid cell (B = 2), resulting in a 7 \times 7 \times 30 output prediction. A simplified example with a 3 \times 3 grid, three classes, and a single bounding box per grid cell would produce a 3 \times 3 \times 8 output.

Performance

YOLOv1 achieved an average precision (AP) of 63.4 on the PASCAL VOC2007 dataset, demonstrating its effectiveness as a real-time object detection system.

Discussion