🦏

【ML Paper】YOLOv2: part8

に公開

This time, I'll introduce the YOLOv2 with the paper by Joseph Redmon and Ali Farhadi. Let's focus and see the difference from yolov1.

This article is part 8. Part 7 is here.

Original Paper: https://arxiv.org/abs/1612.08242

Faster

YOLOv2 is engineered for high-speed detection, crucial for applications such as robotics and self-driving cars that require low-latency predictions.

Traditional detection frameworks typically utilize VGG-16 as the base feature extractor, which, despite its robust classification capabilities, demands approximately 3.069 \times 10^{10} floating-point operations for processing a single image at a resolution of 224 \times 224.

In contrast, YOLO employs a custom network inspired by the GoogLeNet architecture, reducing the computational load to 8.52 \times 10^{9} operations per forward pass.
This modification results in a slight decrease in accuracy, with YOLO achieving an 88.0% top-5 accuracy on ImageNet compared to VGG-16’s 90.0%, while significantly enhancing processing speed.

Darknet-19

To further optimize performance, YOLOv2 introduces Darknet-19 as its new classification model. Darknet-19 consists of 19 convolutional layers and 5 max-pooling layers, leveraging primarily 3 \times 3 filters and doubling the number of channels after each pooling step, similar to VGG architectures.
Incorporating concepts from Network in Network (NIN), Darknet-19 utilizes global average pooling for predictions and integrates 1 \times 1 filters to compress feature representations between 3 \times 3 convolutions. Additionally, batch normalization is employed to stabilize training, accelerate convergence, and regularize the model.

This streamlined architecture requires only 5.58 \times 10^{9} floating-point operations to process an image, achieving 72.9% top-1 accuracy and 91.2% top-5 accuracy on the ImageNet dataset, thereby balancing computational efficiency with commendable classification performance.
For a full description see Table 6.

Discussion