Keywords: Vision and Language, Training-free Composed Image Retrieval, Multimodal Large Language Model, Multimodal Tool-used agent
Abstract: Composed Image Retrieval (CIR) retrieves a target image that preserves the reference image’s content while applying user–specified textual edits. Training-free zero-shot CIR (ZS-CIR) has progressed by casting the task as text-to-image retrieval with pretrained vision–language models, prompting multimodal LLMs to produce target captions. However, these approaches are hindered by frozen priors and a mismatch between free-form text and the retriever’s embedding space. In this work, we introduce TaCIR, a training-free, tool-augmented agent for ZS-CIR that jointly reasons over the reference image and manipulation text, optionally consults external tools, and instantiates the inferred edit as a visual proxy. This proxy grounds implicit intent and reduces text–based retrieval misalignment by enabling also image–to-image image comparisons in the retriever. A single, tool-aware, chain-of-thought prompt emits both an initial target description and an executable tool call; when a tool is invoked, the synthesized evidence is fed back to refine the description and guide retrieval. TaCIR requires no task-specific training and remains inference-efficient. Across four benchmarks and three CLIP backbones, TaCIR yields consistent improvements over strong training-free baselines, with average gains of 2.20% to 4.16%, establishing a new state of the art for training-free ZS-CIR while providing interpretable intermediate visualizations.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11896
Loading