AdaST: Adaptive Semantic Transformation of Visual Representation for Training-free Zero-shot Composed Image Retrieval
Keywords: Zero-shot Composed Image Retrieval, Training-free, Multi-modal, VLM, LLM
TL;DR: We propose an efficient and effective feature-level transformation method for Zero-shot Composed Image Retrieval.
Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image given a reference image and a textual modification instruction. The textual instruction specifies the desired modification, while the remaining visual attributes are preserved for consistency. Recent research has focused on training-free methods that leverage image generation models to synthesize proxy images by combining a reference image with a textual modification. However, this approach is computationally expensive and time-consuming, while relying solely on text queries often results in the loss of crucial visual details. To address these issues, we propose Adaptive Semantic Transformation (AdaST), a new training-free method that transforms reference image features into proxy features guided by text. It preserves visual information more efficiently without relying on image generation. To achieve finer-grained transformation, we introduce an adaptive weighting mechanism that balances proxy and text features, enabling the model to exploit proxy information only when it is reliable. Our method is lightweight and can be seamlessly applied to existing training-free baselines in a plug-and-play manner. Extensive experiments demonstrate that it achieves state-of-the-art performance on three CIR benchmarks while avoiding the heavy cost of image generation and incurring only marginal inference overhead compared to text-based baselines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10606
Loading