📖
A Simple Framework for Open-Vocabulary Segmentation and Detection

2025/04/26に公開
 A Simple Framework for Open-Vocabulary Segmentation and Detection (ICCV2023) [1]
セグメンテーションと物体検出を共同で学習するOpen-vocabulary Segmentation and Detection (OpenSeeD)を提案。

使用している図はすべて論文[1:1]のものを使用しています。

 背景
図(a)OD(物体検出)データセットはクラス数は多いが、疎な情報なのに対して、SG(セグメンテーション)データセットはクラス数は少ないが密な情報である。例えば、ODデータセットとして一般的に使用されるObjects365[2]には、約1.7Mの画像に365のクラスのアノテーションが付与されているのに対して、COCO[3]のマスクアノテーションには、0.1Mの画像に133のクラスしか含まれていない。

ODとSGの空間的情報の粒度や含まれるクラス数のギャップから、これまでの手法は片方のタスクの性能向上に焦点を当ててきた。

※ITP: image-text pairs
図(b)Type1(Mask R-CNN[4]): データセットに完全なbox+maskが必要。
Type2(Mask dino[5]): 物体検出学習後にセグメンテーションデータセットでfine-tuning。Closed-setなモデルとなる。
Tyep3(OpenSeeD、提案手法)：物体検出とセグメンテーションの両方のアノテーションがないデータからも学習可能かつOpen-Vocabularyに対応するモデル。

 提案手法

 Decoupled Foreground and Background Decodingセグメンテーションでは前景と背景の両方を認識する必要があるのに対して、物体検出では前景のみの識別を行う。そのため、両方のタスクに同じqueryを使用するとタスク干渉が起こり性能が下がってしまう。

そこで、Decoupled Foreground and Background Decodingを提案。

前景と背景を識別するクエリを分ける。

Q_f: Foreground queries

Q_b: Background queries
これらのクエリからそれぞれ前景と背景のマスク、バウンディングボックス、クラス<P_{f}^{m}, P_{f}^{b}, P_{f}^{c}>, <P_{b}^{m}, P_{b}^{b}, P_{b}^{c}>を推定する。
そのため、使用するlossは以下のようになる。



ここで、\mathbf{c}はGTクラス、\mathbf{b}はGTバウンディングボックス、\mathbf{m}はGTマスク、\mathbf{\hat{b}}は予測したセグメンテーションマスクから得られるバウンディングボックスである。
また、Open-vocabularyタスクでは膨大な量のテキストを識別する必要があるが、採用できるQ_fの数は限られる。そこで、language-guided foreground query selectionを提案。

\mathbf{E^B} = \mathbf{Head}(\mathbf{O}) ,  \mathbf{E^c} = \mathbf{Sim}(\mathbf{O}, \mathbf{T})
Head (bounding box head)から得られる\mathbf{E^B}と、image features Oとtext features Tによるclassification score \mathbf{E^c}のtopkをもとにquery selectionを行う。

※query selectionはDeformable DETR[6]やEfficient DETR[7]、DINO[8]、RT-DETR[9]等で使用されているのでそちらを参照
Q_Bについてはreference pointsが大きく広がってしまうことや、カテゴリ数が少ないことから、learnable queryを採用。

※reference pointsについてはDeformable DETR[6:1]を参照。

 Conditioned Mask DecodingOpenSeeDでは物体検出データセットとセグメンテーションデータセットを使って、両方のタスクの性能を向上させたい。

しかし、セグメンテーションマスクからはバウンディングボックスを作ることは容易にできるのに対して、バウンディングボックスからセグメンテーションマスクを作ることはできない。

これを解決するためにConditioned Mask Decodingを提案。



Conditioned Mask Decodingでは、セグメンテーションデータセットで、GTバウンディングボックスとクラスラベルからマスク生成を学習し、物体検出データセットに対しても、ボックスとラベルを条件にマスクを生成することで、マスク学習を支援する。



なお、上図のように、GTバウンディングボックスとクラスラベルから生成されるConditioned queriesQ_dは他のクエリ(Q_f, Q_b)とはSelf-Attentionを行わない。
マスク生成後には２通りの使用法がある。
Online Mask Assistance: GTとクエリのマッチング時に生成マスクと推定マスクのIoUも計算に含める。ただし、mask lossは使わない。
Offline Mask Assistance: マスク生成学習後、物体検出データにも疑似マスクラベルを付与して学習する。

 結果panoptic segmentationデータセット(COCO)と物体検出データセット(Objects365)で学習。

脚注
Zhang, Hao, et al. "A simple framework for open-vocabulary segmentation and detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. ↩︎ ↩︎
Shao, Shuai, et al. "Objects365: A large-scale, high-quality dataset for object detection." Proceedings of the IEEE/CVF international conference on computer vision. 2019. ↩︎
Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer International Publishing, 2014. ↩︎
He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017. ↩︎
Li, Feng, et al. "Mask dino: Towards a unified transformer-based framework for object detection and segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. ↩︎
Zhu, Xizhou, et al. "Deformable detr: Deformable transformers for end-to-end object detection." arXiv preprint arXiv:2010.04159 (2020). ↩︎ ↩︎
Yao, Zhuyu, et al. "Efficient detr: improving end-to-end object detector with dense prior." arXiv preprint arXiv:2104.01318 (2021). ↩︎
Zhang, Hao, et al. "Dino: Detr with improved denoising anchor boxes for end-to-end object detection." arXiv preprint arXiv:2203.03605 (2022). ↩︎
Zhao, Yian, et al. "Detrs beat yolos on real-time object detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. ↩︎