Training-Free Pseudo-Fusion Strategies for Composed Image Retrieval via Diffusion and Multimodal Large Language Models

TMLR Paper7789 Authors

05 Mar 2026 (modified: 12 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Composed Image Retrieval (CIR) is an emerging paradigm in Content-based Image Retrieval that enables users to formulate compositional queries by combining a reference image with an auxiliary modality, usually text-based. This approach supports fine-grained search where the target image shares structural elements with the user-provided image but is modified according to the provided auxiliary text. Conventional CIR methods rely on multimodal fusion to combine visual and textual features into a joint query embedding, which requires training modules that align composed queries with the targets. In this work, we propose PEFUSE (for pseudo-fusion), a training-free framework that leverages pretrained models to bridge modalities via generative conversion. We introduce two novel strategies: uni-directional and bi-directional conversion, both implemented using diffusion models and multimodal large language models, converting CIR to four single-modality retrieval problems. These methods reformulate CIR as either intra-modal or cross-modal single-query retrieval tasks, bypassing the need for dedicated training. Extensive experiments on standard benchmarks demonstrate that converting CIR into text-to-image retrieval tasks is better than alternative conversion strategies, achieving competitive or superior performance compared with state-of-the-art methods, while maintaining strong time efficiency. These results highlight the effectiveness of the pseudo-fusion paradigm for composed retrieval. Our code is available at: https://anonymous.4open.science/r/ComposedImageRetrieval-9241.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xiao_Luo3
Submission Number: 7789
Loading