Token Masking Transformer for Weakly Supervised Object Localization

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Weakly supervised object localization (WSOL) is both a promising and challenging task that aims to achieve object localization exclusively through image category labels for supervision. Visual transformers have recently been applied to WSOL, demonstrating significant success through the exploitation of long-range feature dependencies in self-attention mechanisms. However, the transformer-based approach suffers from the same partial activation problem as the CNN-based approach due to the use of the classification task to train self-attention map, i.e., only a few discriminative regions are assigned high attention response and thus the localization map does not cover the whole object. To alleviate this problem, we propose a plug-and-play Token Masking Transformer (TMT) method to help transformer-based WSOL methods to obtain a more complete localization map by dynamic discriminative token masking. Specifically, a batch-wise discriminative token selection strategy is first introduced to flexibly determine the tokens to be masked in each image. Then, we design a token masking transformer block to perform token masking and inspire the network to mine more object-related tokens. Besides, we also design an intermediate token activation loss to further improve the performance of TMT by imposing constraints on intermediate tokens. Extensive experiments demonstrate that our TMT can substantially improve the performance of existing transformer-based methods without increasing the computational cost, and achieves state-of-the-art performance on two mainstream benchmarks.
Loading