Abstract: Due to the complexity and variability of the relationships between objects, it becomes very difficult to detect the relationships between them. Scene graph generation (SGG) has been receiving attention as a challenging task in computer vision. Most of the existing scene graph generation methods use two-stage or point-based single-stage methods, but these methods usually suffer from excessive time complexity or poor design assumptions. In this paper, we adopt a single-stage generation method inspired by the transformer. In this, the main body still uses Convolutional Neural Network (CNN) for image feature extraction, and then the extracted features are given to the transformer decoder for encoding and decoding, and then processed to obtain the scene graph. The work in this paper lies in 1) adding a predicate generator to the traditional transformer decoder, and 2) evaluating it on some improved visual genome-based datasets, and the results show that the method improves the SGG's relationship recognition ability.
Loading