Domain-adaptive In-context Generation Benefits Composed Image Retrieval

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: composed image retrieval, text-to-image generation, domain adaptation
Abstract: As a vision-language task, Composed Image Retrieval (CIR) aims to integrate information from a bi-modal query (image + text) to retrieve target images. While supervised CIR has achieved notable success in domain-specific scenarios, its reliance on manually annotated triplets restricts its scalability and application. Zero-shot CIR alleviates this by leveraging unlabeled data or automatically collected triplets, yet it often suffers from an intractable domain gap. To this end, we shift the focus to developing robust CIR models under limited labeled data and propose Domain-Adaptive In-context Generation (DAIG), which adapts the in-context capability of a pretrained Text-to-Image (T2I) model to the target domain and the CIR task using few-shot samples and then transforms the LLM-generated textual triplets into unbiased CIR triplets as additional training data. After that, we present a two-stage framework applicable to any supervised CIR approach. The first stage, Distributionally Robust Synthetic Pretraining (DRSP), perturbs visual features to expand the distribution of synthetic data and improve training robustness on it. The second stage, Fine-grained Real-world Adaptation (FRA), fine-tunes on manually annotated triplets by imposing an angular margin on matching pairs to facilitate fine-grained learning. Experiments on two benchmarks validate the effectiveness of our method, $i.e.,$ under both few-shot and fully supervised CIR settings, DAIG yields substantial performance gains over CLIP4CIR, BLIP4CIR, and SPRC. The code and data will be released as open source.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10949
Loading