Training-Free Pseudo-Fusion Strategies for Composed Image Retrieval via Diffusion and Multimodal Large Language Models
Keywords: Composed Image Retrieval, Modality Conversion, Modality Fusion, Generative Models, Training-free Methods
Abstract: Composed Image Retrieval (CIR) is an emerging paradigm in content-based image retrieval that enables users to formulate complex visual queries by combining a reference image with an auxiliary modality, usually text-based. This approach supports fine-grained search where the target image shares structural elements with the user query but is modified according to the provided auxiliary text. Conventional CIR methods rely on multimodal fusion to combine visual and textual features into a joint query embedding. In this work, we propose PEFUSE (for pseudo-fusion), a training-free framework that leverages pre-trained models to bridge modalities via generative conversion. We introduce two novel strategies: uni-directional and bi-directional conversion, both implemented using diffusion models and multimodal large language models. These methods reformulate CIR as either intra-modal or cross-modal retrieval, bypassing the need for dedicated training.
Extensive experiments on standard benchmarks show that our approach achieves competitive or superior performance compared to state-of-the-art methods, highlighting the efficacy and flexibility of our pseudo-fusion paradigm for composed retrieval.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21815
Loading