Abstract: The current published methods of image captioning are directly inputting the features of objects in image into model, and introduced a variety of attention mechanisms to capture the associations between the objects and specific words. But the relationships of vision and semantic between objects are not sufficiently concerned. In this paper, we propose a relational graph reasoning Transformer which explicitly incorporates the relationships of vision and semantic between objects to construct an object relational graph in Transformer. Specifically, besides the detected object features, the global spatial relationships and the semantic context between different objects is attended. Meanwhile, a graph structures feature which correlates object features, their spatial and semantic information is reasoned by a learned grafting mechanism. Finally, the contextual graph feature is integrated into the proposed Transformer decoder. Experimental results demonstrate the significance of our relationship reasoning Transformer model.
0 Replies
Loading