【ML Paper】YOLOv2: part10
This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.
This article is part 10. Part 9 is here.
Original Paper: https://arxiv.org/abs/1612.08242
Joint Training Approach
The authors introduce a joint training mechanism that leverages both classification and detection datasets to enhance object detection capabilities. This approach utilizes images labeled for detection to learn detection-specific features, such as bounding box coordinate prediction and objectness scores, while simultaneously classifying common objects.
Additionally, images with only class labels are incorporated to broaden the range of detectable categories. During training, images from detection and classification datasets are interleaved.
When the network processes a detection-labeled image, it backpropagates using the complete YOLOv2 loss function. Conversely, for classification-labeled images, backpropagation is restricted to the classification-specific components of the architecture.
Label Merging Strategy
A critical aspect of the method is the coherent integration of labels from both datasets.
Detection datasets typically include general labels like "dog" or "boat," whereas classification datasets, such as ImageNet, offer a more granular taxonomy with over a hundred dog breeds, including "Norfolk terrier," "Yorkshire terrier," and "Bedlington terrier."
To effectively train on both datasets, it is essential to merge these labels in a manner that accommodates their hierarchical and non-mutually exclusive nature.
Challenges
One of the primary challenges in this joint training approach is the incompatibility of label structures between classification and detection datasets.
Classification models often employ a softmax layer across all possible categories to determine the final probability distribution, inherently assuming that classes are mutually exclusive. This assumption complicates the integration of datasets like ImageNet and COCO, where labels such as "Norfolk terrier" and "dog" overlap and are not mutually exclusive.
An alternative is to adopt a multi-label model that does not enforce mutual exclusivity. However, this strategy disregards the inherent structure of the data, such as the mutual exclusivity of COCO classes, potentially undermining the model's performance.
Proposed Solution
To address these challenges, the method avoids relying solely on softmax-based classification. Instead, a strategy is implemented that allows for the coexistence of detailed and general labels without enforcing mutual exclusivity.
This enables the model to recognize both specific categories from classification datasets and general categories from detection datasets, thereby expanding the range of detectable objects while maintaining the structural integrity of the data.
Discussion