Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval

Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, Liqiang Nie

Published: 01 Jan 2025, Last Modified: 22 Jan 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Image-text retrieval (ITR) is a pivotal task in cross-modal research. However, existing methods often suffer from a fundamental yet overlooked challenge: redundancy. This issue manifests as both semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments. This not only inflates computational costs but also degrades retrieval accuracy by masking salient features and reinforcing spurious correlations. In this work, we are the first to explicitly analyze and address the ITR problem from a redundancy perspective by proposing the iMage-text rEtrieval rEdundancy miTigation (MEET) framework. MEET employs a cascaded, two-stage process to systematically mitigate both forms of redundancy. First, for Semantic Redundancy Mitigation, it repurposes deep hashing and quantization as synergistic tools, producing compact yet highly discriminative representations. Second, for Relationship Redundancy Mitigation, it progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. The structural integration of these modules under a unified optimization objective provides a clear and interpretable pathway to retrieval. Extensive experiments on multiple benchmarks demonstrate that MEET consistently surpasses state-of-the-art methods, validating its effectiveness and generalizability.

External IDs:doi:10.1109/tcsvt.2025.3643601