Learning Efficient and Adaptive Cross-Channel Dependencies for Weakly-Supervised Object Detection
Abstract: Recent progress in weakly-supervised object detection (WSOD) is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, since most WSOD methods only use imagelevel annotations, the serial stacking of convolutional blocks in MIDN cannot effectively model multi-channel information, often emphasizing only the most prominent parts of the target while ignoring the entire objects, thus affecting detection performance. In this paper, we investigate how to effectively use multichannel data to improve the model’s ability to detect long-range dependencies, introducing CC-DETR (Cross-Channel DETR), a new weakly-supervised object detection framework. Specifically, we propose Cross-Channel Adaptive Convolution (CCAC), a module that captures different spatial features at multiple scales, increases the receptive field, and adaptively weights each important feature to guide the model to focus on long-term dependencies. Moreover, we designed a new attention mechanism called Dual-Stream Self-Attention (DSSA). This mechanism uses convolutions with adaptive sizes to capture multi-scale information, preserving long-range dependencies while supporting local feature responses, enhancing the model’s ability to capture longrange dependencies. Extensive experiments demonstrate that our proposed method outperforms the current end-to-end state of the art (+2.3% mAP in VOC, +2.3% AP50 in COCO). Moreover, our method can be easily integrated into various DETR and ViT models with minimal modifications. The code will be available at https://github.com/cpy0029/CC-DETR.
Loading