🦏

【ML Paper】YOLOv2: part6

2024/11/08に公開

This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.

This article is part 6. Part 5 is here.

Original Paper: https://arxiv.org/abs/1612.08242

Direct Location Prediction

In the context of using anchor boxes with YOLO, a significant challenge arises from model instability, particularly during the initial training iterations.
This instability primarily stems from the prediction of the (x, y) coordinates for bounding boxes. Traditional region proposal networks calculate the center coordinates as:

x = (t_x \times w_a) + x_a \\ y = (t_y \times h_a) + y_a

Here, t_x and t_y represent the predicted offsets, while w_a and h_a denote the width and height of the anchor box, and (x_a, y_a) are the anchor box's center coordinates.
For instance, a prediction of t_x = 1 shifts the box rightward by the anchor box's width, whereas t_x = -1 shifts it leftward by the same amount. This approach allows anchor boxes to be positioned anywhere within the image, leading to unrestricted variability that hampers the model's ability to stabilize quickly, especially with random initialization.

To mitigate this issue, the approach adopted by YOLO involves predicting location coordinates relative to the grid cell's position.
By bounding the ground truth coordinates between 0 and 1 using a logistic activation function, the network's predictions are constrained within this range. Specifically, the network predicts five bounding boxes per cell in the output feature map, each characterized by five parameters: t_x, t_y, t_w, t_h, and t_o.
Given a cell offset (c_x, c_y) from the image's top-left corner and an anchor box with dimensions p_w and p_h, the bounding box parameters are computed as:

b_x = \sigma(t_x) + c_x \\ b_y = \sigma(t_y) + c_y \\ b_w = p_w e^{t_w} \\ b_h = p_h e^{t_h} \\ P_{\text{object}} \times \text{IOU}(b, \text{object}) = \sigma(t_o)

This constrained parameterization simplifies the learning process, enhancing the network's stability.
Additionally, leveraging dimension clusters and directly predicting the bounding box center locations results in an approximate 5% performance improvement over the traditional anchor box approach.

Fine-Grained Features

The modified YOLO architecture performs detections on a 13 \times 13 feature map, which suffices for identifying large objects.
However, accurately localizing smaller objects benefits from higher-resolution features.
Unlike Faster R-CNN and SSD, which utilize multiple feature maps at varying resolutions for their proposal networks, the enhanced YOLO incorporates a passthrough layer to integrate finer-grained features.

This passthrough layer extracts features from an earlier layer with a 26 \times 26 resolution and concatenates them with the existing low-resolution features.
Instead of merging spatial locations, adjacent features are stacked into different channels, analogous to the identity mappings in ResNet. Consequently, the 26 \times 26 \times 512 feature map is transformed into a 13 \times 13 \times 2048 feature map, which is then combined with the original features.

The detector operates on this augmented feature map, gaining access to detailed information that aids in the localization of smaller objects. This modification yields a modest performance increase of approximately 1%.

Discussion