MLNet: Mutual Learning Network to Improve Self-Supervised Representation for Fine-Grained Visual Recognition

Published: 2025, Last Modified: 06 Jan 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: High-quality annotation of fine-grained visual categorization requires extensive professional knowledge, which is time-consuming and laborious. Therefore, learning fine-grained visual representations from a large number of unlabeled images through self-supervised learning has become a popular alternative solution recently. However, the existing self-supervised learning methods are not effective in fine-grained visual categorization since many features helpful for optimizing self-supervised learning objectives are unsuited to characterize the subtle differences in fine-grained viusal recognition. To deal with this issue, we propose a mutual learning network to enhance the model’s attention towards discriminative semantic features. The key idea is to consider semantic consistency between different augmented views within same image and capture discriminative semantic information. For semantic consistency, our research demonstrates that cross-view attention module between different augmented views can guide our model to capture similar semantic features. Based on this, we further build a GradCAM-guided multi-dimension loss that utilize GradCAM to control our model from different dimensions to discriminative semantic information that are beneficial to fine-grained visual recognition. Experiments on CUB-200-2011, Stanford Cars and Aircrafts datasets demonstrate that the mutual learning network outperforms previous self-supervised learning methods in linear probing and image retrieval.
Loading