Perceive, Reason, and Align: Context-guided cross-modal correlation learning for image-text retrieval
Abstract: Highlights•Learns context-guided cross-modal correlation for image–text retrieval.•Generates visual and textual representations by perceiving contextual information.•Learns intra-modal correlation by reasoning relations within each modality.•Learns inter-modal correlation by aligning patches across different modalities.
Loading