Keywords: visual semantic embedding, image-text matching, uncertainty learning, multi-modal learning
Abstract: Visual Semantic Embedding (VSE), as a link between Computer Vision and Natural Language Processing, aims at jointly learning cross-modal embeddings to bridge the discrepancy across visual and textual spaces. In recent years, VSE has achieved great success in image-text matching benefiting from the outstanding representation power of deep learning. However, existing methods produce retrieved results only relying on the ranking of cross-modal similarities, even if the retrieved results are unreliable and uncertain. That is to say, they cannot self-evaluate the quality of retrieved results for trustworthy retrieval, resulting in ignoring the ubiquitous uncertainty in data and models. To address this problem, we propose a novel VSE-based method for image-text matching, namely Trust-consistent Visual Semantic Embedding (TcVSE), to embrace trustworthy retrieval and self-evaluation for image-text matching. To be specific, first, TcVSE models the evidence based on cross-modal similarities to capture accurate uncertainty. Second, a simple yet effective consistency module is presented to enforce subjective opinions of bidirectional VSE models (i2t+t2i) to be consistent for high reliability and accuracy. Finally, extensive comparison experiments are conducted to demonstrate the superiority of TcVSE on two widely-used benchmark datasets, i.e., Flickr30K and MS-COCO. Furthermore, some qualitative experiments are carried out to provide comprehensive and insightful analyses for the reliability and rationality of our method.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning