VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Anonymous

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Anonymous

03 Nov 2017 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by the use of hard negatives in structured prediction, and ranking loss functions used in retrieval, we introduce a simple change to common loss functions used to learn multi-modal embeddings. That, combined with fine-tuning and the use of augmented data, yields significant gains in retrieval performance. We showcase our approach, dubbed VSE++, on the MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval, and 11.3% in image retrieval (based on R@1).

TL;DR: A new loss based on relatively hard negatives that achieves state-of-the-art performance in image-caption retrieval.

Keywords: Joint embeddings, Hard Negatives, Visual-semantic embeddings, Cross-modal retrieval, Ranking

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 9 code implementations](https://www.catalyzex.com/paper/vse-improving-visual-semantic-embeddings-with/code)

Withdrawal: Confirmed

0 Replies

Loading