STT: Towards Training-Free Zero-Shot Composed Image Retrieval via Semantic Transition and Transportation

ICLR 2026 Conference Submission15292 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Zero-shot Composed Image Retrieval; MLLMs; Conditional transport; Bayesian methods
TL;DR: STT
Abstract: Training-free zero-shot composed image retrieval (ZS-CIR) models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, this paper introduces a novel Semantic Transition and Transportation (STT) framework for training-free ZS-CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments and ablations demonstrate that our method can be general, effective, and beneficial for many CIR tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15292
Loading