AdaST: Adaptive Semantic Transformation of Visual Representation for Training-free Zero-shot Composed Image Retrieval

Jeonghoon Kim; JinHyung Lee; Seunghun Lee; Sehyun Hwang; Hao Ni; Jisoo Mok; Sunghoon Im

AdaST: Adaptive Semantic Transformation of Visual Representation for Training-free Zero-shot Composed Image Retrieval

Jeonghoon Kim, JinHyung Lee, Seunghun Lee, Sehyun Hwang, Hao Ni, Jisoo Mok, Sunghoon Im

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Zero-shot Composed Image Retrieval, Training-free, Multi-modal, VLM, LLM

TL;DR: We propose an efficient and effective feature-level transformation method for Zero-shot Composed Image Retrieval.

Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image given a reference image and a textual modification instruction. The textual instruction specifies the desired modification, while the remaining visual attributes are preserved for consistency. Recent research has focused on training-free methods that leverage image generation models to synthesize proxy images by combining a reference image with a textual modification. However, this approach is computationally expensive and time-consuming, while relying solely on text queries often results in the loss of crucial visual details. To address these issues, we propose Adaptive Semantic Transformation (AdaST), a new training-free method that transforms reference image features into proxy features guided by text. It preserves visual information more efficiently without relying on image generation. To achieve finer-grained transformation, we introduce an adaptive weighting mechanism that balances proxy and text features, enabling the model to exploit proxy information only when it is reliable. Our method is lightweight and can be seamlessly applied to existing training-free baselines in a plug-and-play manner. Extensive experiments demonstrate that it achieves state-of-the-art performance on three CIR benchmarks while avoiding the heavy cost of image generation and incurring only marginal inference overhead compared to text-based baselines.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10606

Loading