Abstract: Image-text retrieval (ITR) is a pivotal task in cross-modal research. However, existing methods often suffer from a fundamental yet overlooked challenge: redundancy. This issue manifests as both semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments. This not only inflates computational costs but also degrades retrieval accuracy by masking salient features and reinforcing spurious correlations. In this work, we are the first to explicitly analyze and address the ITR problem from a redundancy perspective by proposing the iMage-text rEtrieval rEdundancy miTigation (MEET) framework. MEET employs a cascaded, two-stage process to systematically mitigate both forms of redundancy. First, for Semantic Redundancy Mitigation, it repurposes deep hashing and quantization as synergistic tools, producing compact yet highly discriminative representations. Second, for Relationship Redundancy Mitigation, it progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. The structural integration of these modules under a unified optimization objective provides a clear and interpretable pathway to retrieval. Extensive experiments on multiple benchmarks demonstrate that MEET consistently surpasses state-of-the-art methods, validating its effectiveness and generalizability.
External IDs:doi:10.1109/tcsvt.2025.3643601
Loading