Improved Text-Image Matching by Mitigating Visual Semantic Hubs

Anonymous

22 May 2019 (modified: 06 Sept 2019)OpenReview Anonymous Preprint Blind SubmissionReaders: Everyone

Abstract: The hubness problem widely exists in high-dimensional embedding space and is a fundamental source of error for cross-modal matching tasks. In this work, we study the emergence of hubs in Visual Semantic Embeddings (VSE) with application to text-image matching. We introduce novel methods that mitigate hubs during both training and inference. For training, we analyze the pros and cons of two widely adopted optimization objectives and propose a novel hubness-aware loss function. The loss is self-adaptive in the sense that it utilizes local statistics to scale up the weights of ``hubs'' within a mini-batch. For inference, we propose a heuristic algorithm that imposes hard constraints on the existence of hubs in the predicted graph. It can be combined with previously proposed cross-modal retrieval criterion which together achieve even better performance. We experiment our methods with various configurations of model architectures and datasets. Both the loss function and the heuristic algorithm exhibit surprisingly good robustness and bring consistent improvement on the task of text-image matching across all settings. Specifically, we report results on Flickr30k and MS-COCO datasets that are above the state-of-the-art.

Withdrawal: Confirmed

0 Replies