Abstract: In visual relationship detection (VRD), the diversity of relationships often results in many unseen (i.e. zero-shot) relationships in the test set. Predicting zero-shot relationships poses a significant challenge. Traditional methods often incorporate semantic knowledge and spatial features from images, which still rely on priors in language and annotations in the dataset. Therefore, we propose encoding spatial structure information in the knowledge graph and incorporating spatial relationships from commonsense to guide predictions. The model comprises two modules: logic tensor networks that encoded the negative domain of semantic and spatial knowledge, and a commonsense knowledge graph module updated by local spatial structure as positive domain semantic knowledge. Predictions are further constrained by region connection calculus (RCC). Experimental results demonstrate competitive performance on the Visual Relationship Datasets under the zero-shot setting and the entire subset of Visual Genome. In predicate detection, it achieves comparable results to benchmarks, while significantly outperforming benchmarks in relationship and phrase detection.
Loading