SwinTransFuse: Fusing swin and multiscale transformers for fine-grained image recognition and retrieval

Xu Ouyang, Tao Zhou, Rene Vidal, Arnab Dhua

Published: 19 Jun 2022, Last Modified: 14 May 2024CVPR workshop 2022EveryoneRevisionsCC BY 4.0

Abstract: Fine-grained recognition and retrieval are complex tasks in computer vision due to the high level of similarity between images of different subclasses. Recent work on fine-grained image recognition achieved significant improvements by using the attention mechanisms of the Vision or Swin Transformers to find discriminative image regions at coarse or fine scales, respectively. Here, we propose SwinTransFuse, a novel architecture for fine-grained recognition that fuses a Swin transformer and a Multiscale Vision transformer to capture both local and global features. We also propose Swin Transformer and SwinTransFuse siamese networks for fine-grained image retrieval. Our methods reach state-of-the-art performance on the CUB-200-2011 and Stanford Online Products fine-grained datasets.