【ML Paper】YOLOv2: part12
This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.
This article is part 12. Part 11 is here.
Original Paper: https://arxiv.org/abs/1612.08242
Dataset Combination with WordTree
WordTree facilitates the integration of multiple datasets by mapping their categories to synsets within the tree structure.
For instance, labels from ImageNet and COCO are combined seamlessly using WordTree, as illustrated in Figure 6. Given the extensive diversity of WordNet, this method is applicable to a wide range of datasets, enabling a unified framework for category representation.
Joint Classification and Detection
Leveraging WordTree to merge datasets allows for the training of a joint model that handles both classification and detection tasks.
An extensive detector is trained using a combined dataset comprising the COCO detection dataset and the top 9000 classes from the complete ImageNet release. To ensure comprehensive evaluation, additional classes from the ImageNet detection challenge, not previously included, are incorporated. The resulting WordTree for this dataset encompasses 9418 classes. To balance the significantly larger ImageNet dataset, COCO is oversampled, maintaining a ratio where ImageNet is only four times larger.
Training YOLO9000
YOLO9000 is trained on the combined dataset using the YOLOv2 architecture, modified to include only three priors instead of five to reduce output size.
During training, when the network processes a detection image, loss is backpropagated normally. For classification loss, only losses at or above the label's corresponding level in the tree are backpropagated. For example, if the label is "dog," errors are not assigned to more specific predictions like "German Shepherd" or "Golden Retriever" due to the lack of detailed information.
Conversely, when processing a classification image, only the classification loss is backpropagated. This involves identifying the bounding box with the highest probability for the class and computing the loss based on its predicted tree, assuming an Intersection over Union (IOU) of at least 0.3 with the ground truth to backpropagate objectness loss.
Evaluation on ImageNet Detection
YOLO9000's performance is assessed on the ImageNet detection task, which shares 44 object categories with COCO. This means that for the majority of test images, YOLO9000 has only encountered classification data rather than detection data.
The model achieves an overall mean Average Precision (mAP) of 19.7, with a mAP of 16.0 on the 156 disjoint object classes that lack labeled detection data. This performance surpasses that of Deformable Parts Models (DPM), despite YOLO9000 being trained on different datasets with partial supervision. Additionally, YOLO9000 is capable of detecting 9000 other object categories in real-time.
Performance Analysis
Analyzing YOLO9000's performance on ImageNet reveals that the model effectively learns new animal species but faces challenges with categories related to clothing and equipment.
The successful learning of new animals is attributed to the generalization of objectness predictions from the animals present in COCO. In contrast, categories such as "sunglasses" or "swimming trunks" are problematic because COCO does not provide bounding box labels for clothing items, limiting YOLO9000's ability to model these categories accurately.
Discussion