SGTR: Generating Scene Graph by Learning Compositional Triplets with TransformerDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Computer Vision, Scene graph Generation, Scene Understanding
Abstract: In this work, we propose an end-to-end framework for the scene graph generation. Motivated by the recently introduced DETR, our method, termed SGTR, generating scene graphs by learning compositional queries with Transformers. We develop a decoding-and-assembling paradigm for the end-to-end scene graph generation. Based on a shared backbone, the overall structure first consists of two parallel branches: entity detector and triplet constructor, followed by a newly designed assembling mechanism. Specifically, each triplet is constructed by a set of the compositional queries in the triplet constructor. The predicate queries and entity queries are learned simultaneously with explicit information exchange. In the training phase, the grouping mechanism is learned by matching the decoded triplets with the outcome of the entity detector. Extensive experimental results show that SGTR can achieve state-of-the-art performance, surpassing most of the existing approaches. Moreover, the sparse queries significantly improving the efficiency of scene graph generation. We hope our SGTR can serve as a strong baseline for the Transformer-based scene graph generation.
One-sentence Summary: We propose a transformer-based one-stage scene graph generation method achieves the state-of-the-art or comparable performance on all metrics.
Supplementary Material: zip
5 Replies

Loading