Abstract: Transformers have achieved success in many computer vision tasks, but their potential in Zero-Shot Learning (ZSL) has yet to be fully explored. In this paper, a Transformer architecture is developed, termed DSPformer, which can discover semantic parts by token growth and clustering. This is achieved through two proposed methods: Adaptive Token Growth and Semantic Part Clustering. Firstly, it is observed that the background may distract models, causing the model to rely on irrelevant regions to make decisions. To alleviate this issue, the ATG is proposed to locate discriminative foreground regions and remove meaningless and even noisy backgrounds. Secondly, semantically similar parts may be distributed into different tokens. To address this problem, the SPC is proposed to group semantically consistent parts by token clustering. Extensive experiments on several challenging datasets demonstrate the effectiveness of the proposed DSPformer.
Loading