🦏

【ML Paper】YOLOv2: part8

2024/11/08に公開

This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.

This article is part 8. Part 7 is here.

Original Paper: https://arxiv.org/abs/1612.08242

Faster

YOLOv2 is engineered for high-speed detection, crucial for applications such as robotics and self-driving cars that require low-latency predictions.

Traditional detection frameworks typically utilize VGG-16 as the base feature extractor, which, despite its robust classification capabilities, demands approximately 3.069 \times 10^{10} floating-point operations for processing a single image at a resolution of 224 \times 224.

In contrast, YOLO employs a custom network inspired by the GoogLeNet architecture, reducing the computational load to 8.52 \times 10^{9} operations per forward pass.
This modification results in a slight decrease in accuracy, with YOLO achieving an 88.0% top-5 accuracy on ImageNet compared to VGG-16’s 90.0%, while significantly enhancing processing speed.

Darknet-19

To further optimize performance, YOLOv2 introduces Darknet-19 as its new classification model. Darknet-19 consists of 19 convolutional layers and 5 max-pooling layers, leveraging primarily 3 \times 3 filters and doubling the number of channels after each pooling step, similar to VGG architectures.
Incorporating concepts from Network in Network (NIN), Darknet-19 utilizes global average pooling for predictions and integrates 1 \times 1 filters to compress feature representations between 3 \times 3 convolutions. Additionally, batch normalization is employed to stabilize training, accelerate convergence, and regularize the model.

This streamlined architecture requires only 5.58 \times 10^{9} floating-point operations to process an image, achieving 72.9% top-1 accuracy and 91.2% top-5 accuracy on the ImageNet dataset, thereby balancing computational efficiency with commendable classification performance.
For a full description see Table 6.

Discussion