A Realistic Protocol for Evaluation of Weakly Supervised Object Localization

Published: 01 Jan 2025, Last Modified: 12 May 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Weakly Supervised Object Localization (WSOL) allows training deep learning models for classification and localization (LOC) using only global class-level labels. The absence of bounding box (bbox) supervision during training raises challenges in the literature for hyper-parameter tuning, model selection, and evaluation. WSOL methods rely on a validation set with bbox annotations for model selection, and a test set with bbox annotations for threshold estimation for producing bboxes from localization maps. This approach, however, is not aligned with the WSOL setting as these annotations are typically unavailable in real-world scenarios. Our initial empirical analysis shows a significant decline in LOC performance when model selection and threshold estimation rely solely on class labels and the image itself, respectively, compared to using manual bbox annotations. This highlights the importance of incorporating bbox labels for optimal model performance. In this paper11Our scope. This work focuses on WSOL [2],[28]–[30], as opposed to other related tasks such as weakly supervised detection, segmentation, or instance segmentation, which are often mixed in earlier works [11], [18], [40]., a new WSOL evaluation protocol is proposed that provides LOC information without the need for manual bbox annotations. In particular, we generated noisy pseudo-boxes from a pretrained off-the-shelf region proposal method such as Selective Search, CLIP, and RPN for model selection. These bboxes are also employed to estimate the threshold from LOC maps, circumventing the need for test-set bbox annotations. Our experiments22Our code and generated pseudo-bounding boxes can be accessed at github.com/shakeebmurtaza/wsol_model_selection. with several WSOL methods on challenging natural and medical image datasets show that using the proposed pseudo-bboxes for validation facilitates the model selection and threshold estimation, with LOC performance comparable to models selected using GT bboxes on the validation set and threshold estimation on the test set. It also outperforms models selected using class-level labels, and then dynamically thresholded with only LOC maps.
Loading