🐶

【ML Paper】Explanation of all of YOLO series Part 7

2024/11/23に公開

This is an summary of paper that explains yolov1 to v8 in one.
Let's see the history of the yolo with this paper.
This article is part 7, part 6 is here.

Original Paper: https://arxiv.org/pdf/2304.00501

4.3 YOLOv1 Training

Pre-training and Fine-tuning

The authors initially pre-trained the first 20 layers of YOLOv1 at a resolution of 224 \times 224 using the ImageNet dataset. Following this, they appended the last four layers with randomly initialized weights and fine-tuned the entire model on the PASCAL VOC 2007 and VOC 2012 datasets at an increased resolution of 448 \times 448. This higher resolution aimed to enhance the model's ability to detect objects with greater detail and accuracy.

Data Augmentation

To improve the model's robustness and generalization, several data augmentation techniques were employed. These included random scaling and translations of up to 20% of the input image size. Additionally, the authors applied random exposure and saturation adjustments with an upper-end factor of 1.5 in the HSV color space. These augmentations helped the model become invariant to variations in object size, position, and lighting conditions.

Loss Function

YOLOv1 utilized a composite loss function composed of multiple sum-squared errors, designed to optimize different aspects of object detection:

  • Localization Loss: The first two terms of the loss function account for the errors in the predicted bounding box locations (x, y) and sizes (w, h). These errors are calculated only for boxes that contain objects, as indicated by the presence of an object in the corresponding grid cell (1_{\text{obj}_{ij}}). A scale factor of \lambda_{\text{coord}} = 5 is applied to emphasize the importance of accurate bounding box predictions.

  • Confidence Loss: The third and fourth terms measure the confidence scores of the bounding boxes. The third term evaluates the confidence error for boxes that contain objects (1_{\text{obj}_{ij}}), while the fourth term assesses the confidence error for boxes that do not contain objects (1_{\text{noobj}_{ij}}). To account for the majority of boxes being empty, the confidence loss for non-object boxes is scaled down by a factor of \lambda_{\text{noobj}} = 0.5.

  • Classification Loss: The final component of the loss function measures the squared error of the class conditional probabilities for each class, but only for grid cells that contain objects (1_{\text{obj}_{i}}). This ensures that the model focuses on accurately classifying objects where they are present.

By integrating these loss components, YOLOv1 effectively balances the need for precise localization, confident detection, and accurate classification, while mitigating the impact of numerous empty bounding boxes during training.

Discussion