Abstract: Visual Instance Retrieval (VIR) is a hot research topic for its wide application in real world, such as object re-identification in smart city scenarios. However, due to the limited style diversity in source training data, most existing VIR models always fail to generalize well to unseen domain. How to improve the generalizability of VIR models has received increasing attention in most recent years. In this paper, we pay attention to the Single Domain Generalization (SDG) based VIR task, a more challenging but practical problem, where the model is only trained on single domain data and directly evaluated on unseen target domain without any fine-tuning or adaptations. In this case, the limited style variance in training data may cause the model learning incorrect reliance on the superficial style feature and reduce the generalizability of the model. To address this issue, we propose a novel Style-Invariant robust Representation Learning (SIRL) method for the challenging task, which mainly aims to first diversify the training data with style augmentation, and then enforce the model to learn style-invariant features. Specifically, we first design an adversarial style synthesis module which learns to synthesize diverse augmented samples with adversarially learned styles. Then, we devise an invariant feature learning module to minimize cross-domain feature inconsistency between source images and style-augmented images for capturing domain-invariant instance features. In this way, we can prevent the model from over-exploiting semantic content-independent cues (e.g., color) as shortcut features, thereby estimating the pairwise instance similarity more robustly. We integrate our SIRL method with SOTA VIR networks and evaluate its effectiveness on several public benchmark datasets. Extensive experiments clearly show that the SIRL method can substantially improve the generalizability of existing VIR networks in the challenging SDG-VIR setting.
0 Replies
Loading