Abstract: Scene graph generation aims to capture detailed spa-
tial and semantic relationships between objects in an im-
age, which is challenging due to incomplete labelling, long-
tailed relationship categories, and relational semantic over-
lap. Existing Transformer-based methods either employ
distinct queries for objects and predicates or utilize holis-
tic queries for relation triplets and hence often suffer from
limited capacity in learning low-frequency relationships. In
this paper, we present a new Transformer-based method,
called DSGG, that views scene graph detection as a direct
graph prediction problem based on a unique set of graph-
aware queries. In particular, each graph-aware query en-
codes a compact representation of both the node and all
of its relations in the graph, acquired through the utiliza-
tion of a relaxed sub-graph matching during the training
process. Moreover, to address the problem of relational se-
mantic overlap, we utilize a strategy for relation distillation,
aiming to efficiently learn multiple instances of semantic
relationships. Extensive experiments on the VG and the
PSG datasets show that our model achieves state-of-the-
art results, showing a significant improvement of 3.5% and
6.7% in mR@50 and mR@100 for the scene-graph gener-
ation task and achieves an even more substantial improve-
ment of 8.5% and 10.3% in mR@50 and mR@100 for the
panoptic scene graph generation task.
Loading