Visual Relationship Detection Using Joint Visual-Semantic Embedding

Binglin Li, Yang Wang

Published: 2018, Last Modified: 27 Sept 2024ICPR 2018EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Visual relationship detection can serve as the intermediate building block for higher level tasks such as image captioning, visual question answering, image-text matching. Due to the long tail of relationship distribution in real world images, zero-shot predication of relationships that it has never seen before can alleviate stress of collecting every possible relationship. Following zero-shot learning (ZSL) strategies, we propose a joint visual-semantic embedding model for visual relationship detection. In our model, the visual vector and semantic vector are projected to a shared latent space to learn the similarity between the two branches. In the semantic embedding, sequential features in terms of <;sub, pred, obj> are learned to provide the context information and then concatenated with corresponding component vector of the relationship triplet. Experiments show that the proposed model achieves superior performance in zero-shot visual relationship detection and comparable results in non-zero-shot scenario.