Transformer-based Scene Graph Generation Network With Relational Attention Module

Takuma Yamamoto, Yuya Obinata, Osafumi Nakayama

Published: 01 Jan 2022, Last Modified: 15 May 2025ICPR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Scene graph generation is a fundamental task for visual understanding, and the generated scene graphs are leveraged in various downstream tasks. Most existing methods rely only on the features cropped by bounding boxes to predict the relationship between subject and object. This indicates that the region outside the subject and object is ignored. Therefore, these methods lack contextually important information existing outside of the subject and object regions. Furthermore, recently, it has been pointed out that unannotated instances in the training dataset can lead to false suppression of the valid predictions of the model. To this end, we propose a novel transformer-based network and a training scheme for our model. We introduce the relational attention module to overcome the cropped feature problem. This module can adaptively extract the contextually important regions from the entire image via the attention mechanism for each entity pair whose relationship is predicted. Moreover, to train our model, we design a training strategy with instance-level pseudotargets. This design can solve the incomplete annotation problem by appending the instances generated by the trained model as a pseudo-target. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our key designs, and our model achieves state-of-the-art or competitive performance in all tasks: PredCls, SGCls, and SGDet on the Visual Genome dataset.