Multi-Part Token Transformer with Dual Contrastive Learning for Fine-grained Image ClassificationDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 05 Nov 2023ACM Multimedia 2023Readers: Everyone
Abstract: Fine-grained image classification focuses on distinguishing objects from different similar subcategories, which requires the classification model to extract subtle yet discriminative descriptors. Recent Vision Transformer (ViT) has shown an enormous potential for this challenging task, but previous ViT-based methods have primarily focused on improving the relationship between image patches, neglecting the limited expressive capability caused by the single class token.To address this limitation, we propose to learn a Multi-part Token Transformer (MpT-Trans), which extends the class token to multiple tokens presenting various parts, enhancing the model's capability of extracting discriminative information. Specifically, our MpT-Trans model interpolates the vision transformer framework with two modules: (i) the Part-wise Shift Learning (PwSL) module is proposed to extend the single class token to a set of part tokens with differentiable shifts, enabling the model to extract informative representations from different perspectives; (ii) the Dual Contrastive Learning (DuCL) module is introduced to exploit the inter-class and inter-part relationships to regularize the learning of part tokens, enhancing their diversity and discrimination for accurate classification. Extensive experiments and ablation study demonstrate that the proposed MpT-Trans achieves state-of-the-art performance on various fine-grained image benchmark datasets, demonstrating the effectiveness of our proposed method.
0 Replies

Loading