🐶

【ML Paper】Explanation of all of YOLO series Part 5

2024/11/23に公開

This is an summary of paper that explains yolov1 to v8 in one.

Let's see the history of the yolo with this paper.

This article is part 5, part 4 is here.
Original Paper: https://arxiv.org/pdf/2304.00501

 4 YOLO: You Only Look Once
 Real-Time End-to-End ApproachYOLO, introduced by Joseph Redmon et al. in CVPR 2016, was the first to present a real-time end-to-end approach for object detection. The acronym YOLO stands for "You Only Look Once," highlighting its capability to perform object detection in a single network pass. This contrasts with previous methods that relied on sliding windows with classifiers or multi-step processes involving region proposals followed by classification. Additionally, YOLO employs a straightforward regression-based output to predict detection results, differing from methods like Fast R-CNN that separate classification and bounding box regression into distinct outputs.

 4.1  How YOLOv1 works?
 Unified Detection ProcessYOLOv1 streamlines object detection by predicting all bounding boxes simultaneously. It divides the input image into an S×SS \times SS×S grid, where each grid cell predicts BBB bounding boxes along with confidence scores for CCC different classes. Each bounding box prediction includes five values: PcP_cPc​, bxb_xbx​, byb_yby​, bhb_hbh​, and bwb_wbw​. Here, PcP_cPc​ represents the confidence score indicating the likelihood of an object and the accuracy of the bounding box. The coordinates bxb_xbx​ and byb_yby​ denote the box center relative to the grid cell, while bhb_hbh​ and bwb_wbw​ represent the height and width relative to the image.

 Output StructureThe output of YOLO is a tensor of dimensions S×S×(B×5+C)S \times S \times (B \times 5 + C)S×S×(B×5+C), which can be followed by non-maximum suppression (NMS) to eliminate duplicate detections. In the original YOLO implementation, the PASCAL VOC dataset was used, consisting of 20 classes (C=20C = 20C=20), a grid size of 7×77 \times 77×7 (S=7S = 7S=7), and up to 2 bounding boxes per grid cell (B=2B = 2B=2), resulting in a 7×7×307 \times 7 \times 307×7×30 output prediction. A simplified example with a 3×33 \times 33×3 grid, three classes, and a single bounding box per grid cell would produce a 3×3×83 \times 3 \times 83×3×8 output.

 PerformanceYOLOv1 achieved an average precision (AP) of 63.4 on the PASCAL VOC2007 dataset, demonstrating its effectiveness as a real-time object detection system.

Discussion

ログインするとコメントできます