Boosting Training-Free Composed Image Retrieval with Tools

Yuanmin Tang; Minghao Tian; Qingyi Si; Massimiliano Mancini; Jing Yu; Keke Gai; Lin Li; Yuan Gao; Jun Song; Bo Zheng; Gaopeng Gou; Gang Xiong; Qi Wu

Boosting Training-Free Composed Image Retrieval with Tools

Yuanmin Tang, Minghao Tian, Qingyi Si, Massimiliano Mancini, Jing Yu, Keke Gai, Lin Li, Yuan Gao, Jun Song, Bo Zheng, Gaopeng Gou, Gang Xiong, Qi Wu

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision and Language, Training-free Composed Image Retrieval, Multimodal Large Language Model, Multimodal Tool-used agent

Abstract: Composed Image Retrieval (CIR) retrieves a target image that preserves the reference image’s content while applying user–specified textual edits. Training-free zero-shot CIR (ZS-CIR) has progressed by casting the task as text-to-image retrieval with pretrained vision–language models, prompting multimodal LLMs to produce target captions. However, these approaches are hindered by frozen priors and a mismatch between free-form text and the retriever’s embedding space. In this work, we introduce TaCIR, a training-free, tool-augmented agent for ZS-CIR that jointly reasons over the reference image and manipulation text, optionally consults external tools, and instantiates the inferred edit as a visual proxy. This proxy grounds implicit intent and reduces text–based retrieval misalignment by enabling also image–to-image image comparisons in the retriever. A single, tool-aware, chain-of-thought prompt emits both an initial target description and an executable tool call; when a tool is invoked, the synthesized evidence is fed back to refine the description and guide retrieval. TaCIR requires no task-specific training and remains inference-efficient. Across four benchmarks and three CLIP backbones, TaCIR yields consistent improvements over strong training-free baselines, with average gains of 2.20% to 4.16%, establishing a new state of the art for training-free ZS-CIR while providing interpretable intermediate visualizations.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11896

Loading