Keywords: Composed image retrieval; Multimodal fusion; Multimodal retrieval
Abstract: Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, due to the large-scale nature of pre-trained models and the limited training data for CIR task, significant overfitting commonly occurs during fine-tuning, resulting in poor generalization. To address this issue, in this paper, we propose a novel Weight-Regularized Fine-tuning network for CIR, termed WRF4CIR. Specifically, during the fine-tuning process, we introduce adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the model’s learning difficulty on training data, effectively mitigating overfitting. Technically, WRF4CIR explicitly regularizes the flatness of the weight loss landscape, enhancing the model’s robustness to weight perturbations and improving generalization. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15347
Loading