Abstract: Learning visual semantic embedding for image-text matching has achieved high success by using triplet loss to pull positive image-text pairs which share similar semantic meaning and to push negative image-text pairs which share different semantic meaning. Without modeling constraints from image-image or text-text pairs, the generated visual semantic embedding inevitably faces the problem of semantic misalignments among similar images or among similar texts. To solve this problem, we present a contrastive visual semantic embedding framework, named ConVSE, which achieves intra-modal semantic alignment by contrastive learning from augmented image-image (or text-text) pairs and achieves inter-modal semantic alignment by applying hardest-negative-enhanced triplet loss on image-text pairs. To the best of our knowledge, we are the first to find that contrastive learning benefits visual semantic embedding. Extensive experiments on large-scale MSCOCO and Flickr30 K datasets verify the effectiveness of our proposed ConVSE by outperforming visual semantic embedding-based methods and achieving new state-of-the-art.
0 Replies
Loading