Text-Oriented Image Query Representation for Zero-Shot Composed Image Retrieval

Pavan K. Rachabathuni, Andrea Ciamarra, Roberto Caldelli, Marco Bertini

Published: 2025, Last Modified: 04 Mar 2026CBMI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) is the task of retrieving a target image based on a query that combines a reference image with a textual description specifying desired modifications in a zero-shot setting. Existing ZS-CIR models typically fuse visual and textual modalities into a single query representation, but often struggle to capture the fine-grained distinctions essential for accurate retrieval. In this paper, we present TEOZCIR, a transformer-based model that introduces a balanced semantic fusion module and an enhancement mechanism to more effectively integrate multimodal information. The model is built around two core components: the Text-Aware Query Combiner (TAQC) and the Query Enhancer Network (QENet). These components operate in tandem: TAQC dynamically adjusts the semantic contributions of the visual context based on the input text, generating a balanced query representation. This representation is then further refined by QENet, which enhances the fused features to better align with the target image. Throughout the entire process, the model maintains a lightweight architecture with significantly fewer trainable parameters compared to conventional training-based methods. Experiments carried out on three benchmark datasets CIRR, Fashion IQ, and CIRCO to demonstrate that TEOZCIR significantly improves ZS-CIR performance, setting a new bench-mark for multimodal retrieval.
Loading