Keywords: image-text matching; cross-modal retrieval; fine-grained cross-modal aligment
TL;DR: SEPS improves cross-modal alignment via two modules reducing patch redundancy and semantic ambiguity, achieving 23%-86% retrieval improvements on standard benchmarks.
Abstract: Fine-grained cross-modal alignment seeks to establish precise local correspondences between vision and language, serving as a fundamental building block for visual question answering and related multimodal tasks.
However, existing approaches are fundamentally constrained by patch redundancy and ambiguity, stemming from the inherent information density disparity between modalities—visual inputs provide dense spatial information across numerous patches while textual descriptions offer sparse, discrete semantic anchors. To address these limitations, we argue that richer semantic guidance is key in this paper and propose the Semantic-Enhanced Patch Slimming (SEPS) framework. To our knowledge, this is the first work in fine-grained alignment to combine MLLM-generated dense text with original sparse captions for enhanced visual patch selection. Our framework aggregates both sparse and dense textual representations to identify semantically relevant patches, and employs top-k selection with mean value computation to emphasize critical patch-word correspondences. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEPS achieves state-of-the-art performance, outperforming existing methods by 23\%-86\% in rSum across various model backbones, with particularly significant improvements in text-to-image retrieval tasks. Our available code is at https://anonymous.4open.science/r/SEPS/.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 4953
Loading