
【ML Paper】YOLO: Unified Real-Time Object Detection part4


This time, I'll explain the YOLO image detection model with paper.
This is a part4, part5 will publish soon.

Original paper: https://arxiv.org/abs/1506.02640

4. Training

4.1 Pretraining the Convolutional Layers

We pretrained the first 20 convolutional layers of our model on the ImageNet 1000-class dataset. After these layers, we added an average-pooling layer and a fully connected layer. Using the Darknet framework, this network achieved a single-crop top-5 accuracy of 88% on the ImageNet 2012 validation set.

4.2 Adapting the Model for Detection

To adapt the model for object detection, we added four convolutional layers and two fully connected layers with randomly initialized weights. We increased the input resolution from 224 \times 224 to 448 \times 448 pixels to enable finer detection capabilities. The final layer of the network predicts class probabilities and bounding box coordinates. We normalized the bounding box dimensions to be between 0 and 1 by dividing by the image dimensions. For the final layer, we used a linear activation function, while all other layers use the leaky ReLU activation function defined as:

4.3 Loss Function Design

We employed a sum-squared error loss function to optimize the model. To address class imbalance and improve training stability, we adjusted the loss weighting. Specifically, we increased the loss for bounding box coordinate predictions by setting \lambda_{\text{coord}} = 5, and decreased the loss for confidence predictions on boxes without objects by setting \lambda_{\text{noobj}} = 0.5. To mitigate issues with varying box sizes, we predicted the square roots of the bounding box width and height instead of the raw values. During training, we assigned one bounding box predictor per object based on which predictor had the highest Intersection over Union (IOU) with the ground truth.

The multi-part loss function used during training is defined as:

In this loss function, \mathbf{1}_{ij}^{\text{obj}} is an indicator function that denotes if an object appears in cell i and predictor j is responsible for predicting it. The loss penalizes classification errors only if an object is present in the grid cell, and penalizes bounding box errors only for the responsible predictor.

4.4 Training Details

We trained the model for approximately 135 epochs using the PASCAL VOC 2007 and 2012 datasets. When testing on VOC 2012, we included the VOC 2007 test data in the training set. The training was conducted with a batch size of 64, a momentum of 0.9, and a weight decay of 0.0005. Our learning rate schedule involved gradually increasing the learning rate from 10^{-3} to 10^{-2} in the initial epochs to prevent divergence due to unstable gradients. We maintained a learning rate of 10^{-2} for 75 epochs, then reduced it to 10^{-3} for the next 30 epochs, and further reduced it to 10^{-4} for the final 30 epochs.

4.5 Data Augmentation and Regularization

To prevent overfitting, we applied dropout with a rate of 0.5 after the first fully connected layer to prevent co-adaptation between layers. We also performed extensive data augmentation by applying random scaling and translations up to 20% of the original image size. Additionally, we randomly adjusted the exposure and saturation of the images by up to a factor of 1.5 in the HSV color space.
