Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training

Sumit Mamtani, Yash Thesia

Published: 10 Nov 2025, Last Modified: 07 Nov 2025IEEE ISCMIEveryoneCC BY-NC 4.0

Abstract: Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercategory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds, our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods.