🦏

【ML Paper】YOLO: Unified Real-Time Object Detection part2

2024/10/21に公開

This time, I'll explain the YOLO image detection model with paper.
This is a part2, part3 will publish soon.

Original paper: https://arxiv.org/abs/1506.02640

3. Unified detection

3.1 Introduction to Unified Object Detection

In this approach, we unify various components of object detection into a single neural network. The network predicts bounding boxes and class probabilities for the entire image simultaneously. This unified system allows for global reasoning about all objects in an image and enables end-to-end training with real-time speeds, while maintaining high precision in object detection.

3.2 Image Grid Division and Responsibility for Detection

The input image is divided into an S \times S grid. Each grid cell is responsible for detecting an object if its center falls within the grid. Every grid cell predicts B bounding boxes and confidence scores, which reflect both the presence of an object and the accuracy of the bounding box.

3.3 Bounding Box Predictions

Each bounding box prediction contains five values:

  1. x and y coordinates, which represent the center of the box relative to the grid cell,
  2. w (width) and h (height), relative to the entire image,
  3. A confidence score, which is the product of the probability of an object being present and the Intersection Over Union (IOU) between the predicted and actual bounding boxes.

3.4 Class Probability Predictions

Each grid cell predicts class probabilities conditioned on the presence of an object. Regardless of the number of bounding boxes, each grid cell predicts only one set of class probabilities, reflecting the likelihood of different classes within that cell.

3.5 Final Prediction

During testing, the system multiplies the class probabilities with the confidence of the bounding boxes to compute class-specific confidence scores. These scores encode both the probability of a specific class appearing in the box and how well the predicted box matches the object.

3.6 Model Structure and Tensor Representation

The model treats object detection as a regression problem. The image is divided into an S \times S grid, and for each cell, B bounding boxes, confidence scores, and class probabilities are predicted. This information is stored in an S \times S \times (B \times 5 + C) tensor.

3.7 Example Application (PASCAL VOC)

For evaluation on the PASCAL VOC dataset, the parameters are set as S = 7, B = 2, and C = 20 (representing the 20 labeled classes). The final prediction is a 7 \times 7 \times 30 tensor, encoding all the predicted bounding boxes, confidence scores, and class probabilities for the image.

Discussion