🙂

[English ver.] CNN < Transformer ?

2024/12/11に公開

This article is the eighth day of the CyberAgent Developers Advent Calendar 2024. It is also a continuation of "The trick of refining the Detection Transformer for Free Input Resolution".


I am Hyodo, a research engineer engaged in research and development of interactive agents and robots in the CyberAgent AI Lab's Agent Development Team. As a Developer Experts of CyberAgent, I support not only the research activities of the Lab, but also the entire company with technology across the business.

Since incorporating all the essence of the project always leads to long articles exceeding several hundred thousand words and destroying the back-end blogging system, this time I will eliminate all demo code, etc. and summarize as compactly as possible the results of a year's worth of work to increase the mass between the lines.

1. Tasks worked on

In order to understand and track the behavior of a “person” from the perspective of a 2D agent or robot, a fast and accurate object detection model on the order of single-digit ms is sometimes required. This time, I generated a model by training our own dataset on RT-DETRv2 [1], a recently released Detection Transformer model. Please take a look at the sample detection results first to understand the performance of the generated model.

To estimate the state of a person facing the robot in real time, I generated an object detection model for simple tracking by acquiring information on the whole body, head, face, eyes, nose, mouth, ears, right hand, left hand, feet, attributes, and face orientation in eight directions. The model must be lightweight and adaptable to any environmental noise (motion blur, defocus blur, backlighting, halation, low light, very near to very far distance), since it is supposed to be run in real-world edge processing.

image image

Because it is a CNN, or because it is a Transformer, I have no intention of being biased toward any particular architecture, so I will actively use the best ones. Therefore, I thought I would benchmark the Transformer, which is generally recognized as being heavy processing but with high accuracy, to see how well it performs in real-world use.

2. Problems with existing public datasets

There are a variety of public data sets that can be used to train the model, but the problems that I have found from actual visual inspection of multiple data sets are generally listed below. All of the problems are largely common among the datasets, but no one dataset has all of the problems inherent.

  1. too much data
  2. too little variation in background
  3. few variations of noise
  4. few variations of distance
  5. few variations of tilt angle and elevation angle
  6. small objects are not labeled
  7. wrong label assigned
  8. in the case of labels for object detection, the label position is far from the correct position
  9. a label is assigned to a place where no object exists
  10. labels are not labeled with consideration of useful features for practical use in implementation

One can easily imagine that a model trained using such a low quality data set would not be able to properly assess the true performance of the model, as the validation results do not properly assess the true performance of the model. Thus, I am not at all interested in the apparent benchmarks in the various papers.

3. Dataset creation with self-annotation

In order to compare the performance between the pure architectures of CNN and Transformer, I decided to fully guarantee the quality of the public dataset by manually re-annotating all images in the dataset COCO-Hand [2], a subset of MS-COCO, by a single person. After all this time, I will push through with only single-modal images. Our policy for re-annotation is as follows.

  1. Annotate all objects that can be visually identified without omitting any of them.
  2. annotate up to the smallest size of 3 x 3 pixels that can be annotated by the annotation tool used (CVAT [3]).
  3. I don't mind the fact that there is a bias in the number of annotations per class.
  4. be aware that the three sizes Small (< 32 x 32), Medium (32 x 32 to 96 x 96), and Large (> 96 x 96) should be covered as much as possible. (this policy depends on the image set)
  5. do not allow any unnecessary margins at the pixel level.
  6. do not allow any annotations that are smaller than the boundaries at the pixel level.
  7. always annotate even in the presence of high-intensity motion blur, halos, or darkness.

Although not all patterns are covered, I have created a 57-second video at 60FPS showing the actual annotation in perfect compliance with the above criteria. The image used has almost no noise, but the size of the target object is very small, which generally makes it very difficult to annotate. I annotated an image of 480x360 with 2,611 annotations. To make it easier to view the video, the image is stretched to twice its size (960x720), but the actual annotations are applied to an image half this size. Note that the video cannot be displayed at its maximum size in some environments due to the responsive design.

https://youtu.be/ilbHznpE3q0

The actual size annotated image is shown below. Actual size images are recommended to be viewed with a PC browser.

The following figure shows the classes annotated for all image sets and the number of annotations for each class. In total, 530,268 annotations were manually annotated for 12,186 images over the course of a year. One might ask, Why not just use the popular VLM or ViT (e.g. DINOv2 [4]) to auto-annotate the images? However, the performance is too low to be useful. It is better to face the reality.

The presentation that summarizes the essence of annotation will be published on the Lab blog at a later date, so please check back then.

4. Detection Transformer performance

Now, let's look at the detection performance of Transformer. This is the detection performance of RT-DETRv2 generated using the dataset created in the previous chapter.

The prerequisite is that

  1. Test images taken from Face-Detector-1MB-with-landmark [5]
  2. Resolution of test images: 1600 x 898
  3. The RT-DETRv2 variant has the highest detection performance. X size, number of queries in the model: 1,250, resolution of processing in the model: 640 x 640
  4. The model has been trained on 25 classes [6] of each part of the body, including faces
  • Input image

  • Output results
    It seems to be able to detect most of them correctly. However, it doesn't seem to be as good as I thought.

5. CNN vs Detection Transformer

Now, let's try a bold comparison. It is often thought that Transformer has higher base performance than CNN, but what happens when I actually compare them?

  • 1MB CNN vs RT-DETRv2-X
    The CNN output result, which is only 1MB in file size, is on the left, and the Transformer output result, which is 300MB in file size, is on the right. Note that the CNN side is a model that supports processing of free input resolutions, but the RT-DETRv2 side is processed by rescaling internally to 640×640. However, it is clear that the Transformer's detection power is inferior to that of the CNN at a level that anyone can see. The key point to understand is whether or not it is able to detect the faces of people in the background.
1MB CNN RT-DETRv2-X 1,250query

  • YOLOv9-E vs RT-DETRv2-X
    The most powerful CNN, YOLOv9-E 1280×736 [7] [8], which was trained on the same dataset as RT-DETRv2, is on the left, and RT-DETRv2 640×640 is on the right. The score threshold is set to the same value for comparison, but the CNN side includes NMS in the post-processing, while the RT-DETRv2 side does not. The original selling point of the Detection Transformer design is that it does not need to use NMS, but it is clear that the results are quite negative from the perspective of versatility. I will examine the impact of aspect ratio on the internal processing of the model later. By the way, the Train and Val sets of YOLOv9-E and RT-DETRv2 are completely identical, with no difference in the number of images.
YOLOv9-E RT-DETRv2-X, 1,250query

It may seem like there is over-detection, but the situation is not that simple. I understand why you might think that, but I recommend that you actually try it out with the model and sample images in my repository. After filtering the YOLOv9 output with NMS, the final output bounding box count was 2,778. For the time being, I tried taking a close-up of the head, which is probably reflected in the far distance. It seems that the head of the person on the poster on the wall, which is about 100m away, and the head of the person in the back of the glass are also being detected. At this distance, the object is almost equal to a point.

Since comparisons are not fair when the conditions of learning differ, just in case, I will list below the learning conditions of YOLOv9 and RT-DETRv2 , excluding the settings for augmentation that have a minimal impact, and the mAP after learning. What I want to point out here is that both models use the same image set and are set to be trained by resizing to 640x640. In other words, when using the test images shown above, it is clear that RT-DETRv2 has a lower resolution. And most importantly, it is a mistake to judge operational performance simply by looking at the high or low mAP values.

YOLOv9-E RT-DETRv2-X 1,250query
Resolution during training 640×640 640×640
Resolution during validation 640×640 640×640
mAP after training 60.7 65.0
Params 57.3M 76.0M
FLOPS 189.0G 259.0G

The benchmark using MS-COCO described in the RT-DETRv2 paper is shown below. It is certainly superior to CNN in terms of both accuracy and speed, and the same trend is also seen in the relative comparison in the benchmark at hand. YOLOv9 is not listed as a comparison target, but this is not a mistake because RT-DETRv2 has higher numerical performance. However, there is something that does not feel quite right.


Cited: RT-DETRv2 Fig

6. Performance changes according to the number of queries in Transformer

The RT-DETRv2 model used for comparison earlier was trained with an input resolution of 640 x 640 and 1,250 queries. Here, because the Transformer can only process 1,250 queries, or in other words, it can only process 1,250 instances, this may have had a significant impact, so just to be sure, I redefined the model with double the number of queries, 2,500 queries, and tried comparing the results. There are at least several hundred people in the image. As a result, increasing the number of queries greatly exceeds the processing capacity of the hidden layer, etc., and it seems to have a negative impact.

YOLOv9-E RT-DETRv2-X, 2,500query

7. Performance changes due to changes in aspect ratio

It is not fair to compare a CNN, which can handle any resolution, with a Transformer, which can only handle the resolution specified during training, so what happens if I divide the image into left and right halves so that the aspect ratio of the Transformer side is almost 1:1, and make it into a roughly square shape of 800x898? The results are as shown in the figure below.

RT-DETRv2-X, 2,500query, Aspect ratio 1:1, Batch.1 RT-DETRv2-X, 2,500query, Aspect ratio 1:1, Batch.2

You can see that the detection accuracy is clearly improved compared to when the inference is performed with the aspect ratio distorted. The input image size reduction ratio for YOLOv9 is 0.80x horizontally and 0.82x vertically, and the input image size reduction ratio for RT-DETRv2 is 0.80x horizontally and 0.71x vertically. This comparison is not yet fair, but it is not even close to the detection performance of YOLOv9-E. I would also like to re-learn the RT-DETRv2 side at 1280 x 736 and compare it, but the learning cost (learning time and required VRAM capacity) is too high to be realistic for easy verification, and I have decided that the performance improvement benefits gained are not worth the cost invested, so I will skip the verification. The VRAM probably requires about 320 GB.

8. Inference speed

Finally, I will compare them in terms of inference speed. I will compare them under the same conditions as the comparison criteria in the previous chapter. Therefore, please note that the computational cost of the Transformer side is significantly lower and is being compared under superior conditions. In addition, the pre-processing has been merged for both YOLOv9 and RT-DETRv2 models. In addition, the NMS processing has been merged only for the YOLOv9 side, and it has been made into an end-to-end model.

  • Test environment
Verification conditions YOLOv9-E RT-DETRv2-X 1,250query
Input resolution 1280×736 640×640
Inference Runtime onnxruntime-gpu TensorRT EP onnxruntime-gpu TensorRT EP
Number of Inferences 10 10
NMS Yes No

The results of the measurements are as follows. The input resolution of YOLOv9 is significantly larger, so it is at a disadvantage even without verification, but in terms of the fact that the CNN is able to flexibly respond to scaling of the input resolution and maintain accuracy, I think you can see that it has a high operational flexibility but also incurs a reasonable computational cost. The inference time in the table below is the average inference time per inference when 10 inferences are made.

YOLOv9-E RT-DETRv2-X 1,250query
69.31ms 9.79ms

Let's change the input resolution on the YOLOv9 side to 640 x 640, disable NMS, and compare the results. The following are the conditions.

Verification conditions YOLOv9-E RT-DETRv2-X 1,250query
Input resolution 640×640 640×640
Inference Runtime onnxruntime-gpu TensorRT EP onnxruntime-gpu TensorRT EP
Number of Inferences 10 10
NMS No No

The following results were obtained. YOLOv9 is very fast without NMS.

YOLOv9-E RT-DETRv2-X 1,250query
10.20ms 9.79ms

Now, let's try activating NMS on the YOLOv9 side and comparing the results. The conditions are as follows.

Verification conditions YOLOv9-E RT-DETRv2-X 1,250query
Input resolution 640×640 640×640
Inference Runtime onnxruntime-gpu TensorRT EP onnxruntime-gpu TensorRT EP
Number of Inferences 10 10
NMS Yes No

The following results were obtained. Please note that the NMS processing cost on the YOLOv9 side is about the same as the inference cost of the main model, so it will slow down as the number of objects to be detected increases.

YOLOv9-E RT-DETRv2-X 1,250query
20.17ms 9.79ms

9. Summary

After preparing the dataset to be as realistic as possible, I compared the accuracy and speed of CNNs and Transformers, and based on the results, I don't think it's possible to say that CNNs < Transformer. The comparison is being made with the understanding that it is not at all a fair comparison, as the input resolutions of the models used in the test differ greatly. However, it is very important to choose the appropriate architecture depending on the situation in which the model is used. Although it is not mentioned in the paper, I think that those who use the model will understand the importance of data quality and architecture selection. Finally, just to be sure, I will add that when comparing images with relatively low resolution and fewer than 100 objects to be detected, the detection performance of the Transformer is much higher. The Transformer is clearly weak in scaling and has low operational flexibility. Transformer is great because it's fast and has high performance, but I didn't think Attention Is All You Need at all.

Craftsmanship will soon be wiped out. I know that, and history has shown us that. That's why I don't see any essential value in the “artistic” aspects, and I feel that the value lies in accumulating and scaling up “ultimate data” that will never become obsolete. I want to leave behind at least one thing that will make people say, “The Japanese are crazy.”

10. Appendix

11. Cited

[1] https://github.com/lyuwenyu/RT-DETR
[2] http://vision.cs.stonybrook.edu/~supreeth/COCO-Hand.zip
[3] https://github.com/cvat-ai/cvat
[4] https://github.com/facebookresearch/dinov2
[5] https://github.com/Linzaer/Ultra-Light-Fast-Generic-Face-Detector-1MB
[6] https://github.com/PINTO0309/PINTO_model_zoo/tree/main/460_RT-DETRv2-Wholebody25#2-annotation
[7] https://github.com/WongKinYiu/yolov9
[8] https://github.com/PINTO0309/PINTO_model_zoo/tree/main/459_YOLOv9-Wholebody25

Discussion