Abstract: The objective of scene graph generation is to recognize visual relationships from images. Specifically, it aims to detect triplets of visual relationships within a given image and to generate a structured representation of the scene. In order to enhance the model’s cognitive understanding of knowledge associations, this paper proposes a Knowledge-Enhanced Context Representation for Unbiased Scene Graph Generation model. To enhance the model, two types of knowledge are particularly employed. Firstly, human cognition is incorporated by analyzing dataset statistics to derive co-occurrence frequencies of entities and relationships, which serve as commonsense statistical knowledge. Secondly, the visual representations extracted from the pre-trained object detection model are incorporated into the framework as visual knowledge. Lastly, the local semantic representation of triplets weighted by co-occurrence frequencies of entities and relationships, the global semantic representation of the entire image, and visual features are combined as inputs to generate contextual semantic representations for relational triplets. Additionally, this model also demonstrates improvement in addressing the prevalent long-tail problem encountered in current scene graph generation. The effectiveness of the model was validated using public datasets, namely Visual Genome (VG) and Graph Question Answering (GQA). In comparison to the existing HiKER-SGG model, our approach achieved notable improvements in average recall rates, specifically with increases of 2.7\(\%\), 3.1\(\%\), and 3.0\(\%\) for mR@20, mR@50, and mR@100, respectively, in the VG dataset. Moreover, in the GQA dataset, our model nearly doubled the performance compared to both baselines.
External IDs:dblp:conf/apweb/WangLZL24
Loading