Learning Asymmetric Visual Semantic Embedding for Image-Text RetrievalDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Cross-modal retrieval, image-text matching
TL;DR: In this paper, we propose a novel method to calculate visual semantic similarity for image-text matching and achieve outperform recent state-of-the-art methods on two widely used datasets.
Abstract: Learning visual semantic similarity is the key challenge to bridge the correspondences between images and texts. However, there are many inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to accurately compute the similarity between these two modality data. In the mainstream methods, global-level methods cannot effectively handle the above problem, while local-level methods need complicated mechanism, which significantly affects the retrieval efficiency. In this paper, we propose Asymmetric Visual Semantic Embedding (AVSE), which aims to design a novel model to learn visual semantic similarity by explicitly considering the difference in information density between the two modalities and eschew the prohibitive computations. Specifically, to keep the information density of images, AVSE exploits the large spatial redundancy of image regions to capture and concatenate multi-view features as image embedding. It also has a novel module to efficiently calculate the visual semantic similarity of asymmetric image embedding and text embedding via dividing embeddings into many semantic blocks with the same dimension and compute the similarity by finding the optimal match between these semantic blocks. Extensive experiments on large-scale MS-COCO and Flickr30K datasets verify the superiority of our proposed AVSE compared with recent state-of-the-art methods. Compared to the recent NAAF method, our AVSE inference is 1000 times faster on the 1K test set and more accurately on the widely used benchmarks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
5 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview