Dual-Combiner Network with Multi-Attention for Composed Image Retrieval

Tao Yao; Lixian Wang; Ying Li; Zongchao Huang; Zhifeng Xu

Dual-Combiner Network with Multi-Attention for Composed Image Retrieval

Tao Yao, Lixian Wang, Ying Li, Zongchao Huang, Zhifeng Xu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Composed Image Retrieval;Multi-modal Fusion;Image Retrieval

TL;DR: We propose a Dual-Combiner with Multi-attention network that significantly improves composed image retrieval.

Abstract: Composed Image Retrieval (CIR) is a challenging task that aims to retrieve target images based on multimodal queries consisting of a reference image and modifying text. Due to the semantic and modal gaps between images and text, existing CIR methods struggle to accurately compose reference images and modifying text. Although some of these methods can establish fine-grained correspondences between local text tokens and visual regions, they often focus on text-specified content in the reference image, which overlooks the consistency of unmentioned regions with the target image. To address the limitation, we propose a novel Dual-Combiner with Multi-attention (DCMA) network that integrates self-attention, cross-attention, and channel-attention mechanisms to well capture the query intent of users. Specifically, the Global Combiner leverages the multi-attention framework to capture global context information of the query intent of users. In parallel, the Local Combiner is designed to maintain the fine-grained information of reference images in the fused representations for preserving the consistency between the reference and target images. Therefore, the proposed DCMA can precisely encode multi-granularity multi-modal query information into the fused representations by using the multi-attention framework. Extensive experiments demonstrate that DCMA achieves new state-of-the-art results across multiple benchmark datasets, validating its effectiveness in capturing complex multi-modal interactions for composed image retrieval. The source code for this work will be available later.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11135

Loading