Abstract: The challenging task composed image retrieval targets at identifying the matched image from the multi-modal query with a reference image and a textual modifier. Most existing methods are devoted to composing the unified query representations from the query images and texts, yet the distribution gaps between the hybrid-modal query representations and visual target representations are neglected. However, directly incorporating target features on the query may cause ambiguous rankings and poor robustness due to the insufficient exploration of the distinguishments and overfitting issues. To address the above concerns, we propose a novel framework termed SemAntic Distillation from Neighborhood (SADN) for composed image retrieval. For mitigating the distribution divergences, we construct neighborhood sampling from the target domain for each query and further aggregate neighborhood features with adaptive weights to restructure the query representations. Specifically, the adaptive weights are determined by the collaboration of two individual modules, as correspondence-induced adaption and divergence-based correction. Correspondence-induced adaption accounts for capturing the correlation alignments from neighbor features under the guidance of the positive representations, and the divergence-based correction regulates the weights based on the embedding distances between hard negatives and the query in the latent space. Extensive experimental results and ablation studies on CIRR and FashionIQ validate that the proposed semantic distillation from neighborhood significantly outperforms baseline methods.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: In this work, we propose a novel Semantic Distillation from Neighborhood dubbed SADN for composed image retrieval, which is a challenging task in the field of vision and language targeting at measuring the similarities between candidate images and hybrid-modal queries composed of reference images and modification captions. The task demands semantic comprehension of the users' queries with the raw reference images and modification texts and alignments between the target images with the user requirements. The proposed work constructs a neighborhood for each hybrid-modal query and designs adaptive weights for neighbor instances based on the semantic correlations. The enhanced neighborhood representations are further aggregated on the mixed-modal query features to mitigate the heterogeneous gap between the mixed-modal space and visual space. Through improving recall rates of this multi-modal retrieval task with a concise structure, our work has the potential to enhance the usability of multimedia data across a range of real-life applications such as e-commerce platforms and vision-language reasoning researches. All these novelties focus on multimodal compositional learning, cross-modal retrieval, and multimodal semantic learning, which exactly fits the scope of ACM Multimedia conference.
Supplementary Material: zip
Submission Number: 4297
Loading