Abstract: Image-Text Retrieval (ITR) is an important measure of the performance of text and image mutual retrieval, by searching for semantically relevant information in the relevant modality and then matching the corresponding modality, the key difficulty lies in the problem of how to find the relevance of the semantic information between the interacting modalities. Previous work has only used pre-trained models to obtain features in terms of images and text to achieve matching. However, this type of approach lacks the pairing and embedding methods required to effectively match multimodal data. In addition, these efforts also lead to significant loss of fine-grained information within modalities. To alleviate these problems, we propose ISES: an Instantiated Semantic Enhanced Scoring model for cross-modal retrieval. Specifically, we have crafted two efficient Semantic Enhancement Score (SES) modules. One module (SES-1) is specialized to learn t5he semantic similarity between image-to-image and the other is specialized (SES-2) to learn the semantic similarity between text-to-text. In addition, we also introduce Instance loss into ITR to further optimize the image semantic enhancement information and text semantic enhancement information, with a view to obtaining better cross-modal retrieval performance. For instance, compared with the existing best method NAAF, the metric R@1 of our ISES on the MSCOCO testing set is improved by 2.60% and 1.21% on I2T retrieval and T2I retrieval.
Loading