Robust Cross-Modal Retrieval via Generative Semantic Refinement and Exclusion-Guided Adaptation

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-Language Pre-trained (VLP) models are vulnerable to real-world query noise. Current cross-modal Test-Time Adaptation (TTA) methods often rely on high-confidence predictions, which induces confirmation bias and neglects the informative signals in ambiguous Low-Confidence Queries. To address this, we propose Generative Semantic Refinement and Exclusion-Guided Adaptation (ReEx), a robust retrieval framework that extends adaptation to the entire query stream. Specifically, textual structural noise is rectified by a Generative Semantic Refinement (GSR) module, which employs Confidence-Guided Dynamic Fusion to anchor LLM-based repairs and prevent semantic drift. To exploit ambiguous data, adaptation is driven by Exclusion-Guided Proxy Contrastive Learning (EPCL), which imposes negative constraints via Exclusion Sets of unlikely candidates. Experimental results on COCO-C and Flickr-C demonstrate that ReEx consistently outperforms existing TTA methods, achieving significant robustness gains with a justifiable computational trade-off.
Lay Summary: Modern AI systems can match images and text, like finding the correct picture for a caption. However, they struggle when queries contain mistakes, like typos or unclear wording. ReEx is a new method that helps these AI systems stay accurate even with imperfect queries. It first “cleans up” the text using a smart language model, fixing errors while keeping the meaning intact. Then, it learns to safely ignore options that are unlikely to match, using the uncertain information to improve decisions. By combining these steps, ReEx makes the system more robust, using all queries—even the confusing ones—to learn better. Tests on standard image-text datasets show it consistently outperforms existing methods. This approach can improve search engines, recommendation systems, and other AI tools that rely on understanding images and language together. It also demonstrates a way to make AI more reliable in real-world noisy environments.
Primary Area: Deep Learning->Robustness
Keywords: Cross-modal retrieval, Test-time adaptation, Domain Shift
Originally Submitted PDF: pdf
Submission Number: 1672
Loading