Pseudo-label enhancement for weakly supervised object detection using self-supervised vision transformer

Kequan Yang, Yuanchen Wu, Jide Li, Chao Yin, Xiaoqiang Li

Published: 2025, Last Modified: 24 Oct 2025Knowl. Based Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Weakly supervised object detection (WSOD) using image-level labels has gained attention in the computer vision community. Most advanced WSOD approaches generate instance-level labels based on class activation maps (CAMs), which are limited by incomplete box labels and sparse information regarding class labels. These inaccurate pseudo-labels mislead the training of the subsequent detection module, thus hampering detection performance. To address these issues, we propose a novel pseudo-label enhancement (PLE) framework for a one-stage WSOD method that consists of a CAM reassembly (CAMR) and a composite scoring module (CSM). The CAMR explores the collaboration between class priors from CAMs and the semantic-aware grouping ability of self-supervised vision transformers (ViTs). Specifically, CAMR initially propagates object localization cues from CAMs to feature maps obtained using a self-supervised ViT. Subsequently, CAMR utilizes the patch affinity of these feature maps to extract more integral object information and improve the instance localization accuracy and completeness. Subsequently, the reassembled CAMs are refined using a multi-threshold strategy to generate bounding-box pseudo-labels. The CSM replaces one-hot labels with soft-class labels that leverage localization precision and classification confidence to counter information sparseness limitations. The proposed PLE is evaluated using the PASCAL VOC and MS-COCO datasets. The experimental results demonstrate that the performance of PLE surpasses that of the advanced methods. On average, PLE improves by 7.1% mAP on the VOC 2007 test set and 6.0% mAP on the MS-COCO 2017 validation set compared with the one-stage WSOD method.