Abstract: Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing connections between visual and textual modalities at both fine-grained and coarse-grained scales has proven to be an effective strategy for the ITR task. However, existing ITR methods have not sufficiently captured intra-complementarity and inter-complementarity. To address this, we propose a novel Dual-level Correlation Learning Network (DCL-net) to strengthen connections between images and texts by reinforcing the correlation between visual and textual modalities at both the intra-level and inter-level. To capture intra-complementarity, intra-level correlation is learned through two steps. First, cross-modal pre-alignment is conducted. Second, correlation enhancement within the visual and textual modalities is achieved using absolute position encoding for visual fragments and relative position encoding for text fragments, respectively. To capture inter-complementarity, inter-level correlation is learned by integrating visual and textual features at both fine-grained and coarse-grained scales within the F2F and F2C branches. Specifically, a bidirectional correlation mechanism is employed to more effectively distinguish relevant samples from irrelevant ones. Experimental results demonstrate the superiority of DCL-net over state-of-the-art ITR methods.
Loading