Abstract: Multi-Query Image Retrieval (MQIR) aims to establish connections between vision and language by exploring fine-grained region-query alignments. It is still a challenging task owing to its intrinsical ambiguity, where a query matches with multiple semantically similar regions and introduces misleading noises. Although researchers have made great efforts to alleviate the ambiguity in many retrieval-related tasks, there are few attempts considering this bottleneck in MQIR, which greatly limits present performance. To this end, we propose a novel Visual Semantic Contextualization Network (VSCN) to mitigate ambiguity by capturing the contextual knowledge within each image-text pair. Specifically, we first develop a Context Semantic Perception (CSP) module to capture the dual-level context, where a visual context transformer explores the intra-context within regions, and a cross-modal context transformer mines the inter-context among concatenated visual-linguistic embeddings. Then, to yield superior contextual understanding, we strengthen the connotations in context via a Context Semantic Interaction (CSI) module. Particularly, knowledge distillation is first employed to transfer the CLIP-guided semantic into the regional intra-context to complement the potential background information. Then, the intra-context & inter-context interaction is conducted via the self-attention mechanism to link the dual-level context and obtain the interacted contextual knowledge. Our method is evaluated on the Visual Genome dataset and substantially outperforms the state-of-the-art methods (30.3% improvements on Recall@1 in the first round).
External IDs:doi:10.1109/tmm.2025.3590927
Loading