Abstract: Image captioning is to generate textual descriptions for a given image by analyzing its visual semantics. It can be applied for numerous applications such as surveillance, where generating descriptions of images enables a more efficient workflow. However, accurate descriptions require to formulate the interactions among visual objects and semantics, which have not been adequately exploited yet. Therefore, a novel architecture is proposed, namely topic-guided local-global graph neural network, to address the interactions in a two-level scheme. Local information is characterized through visual objects and semantic graphs are introduced to formulate their relations. Global information is characterized with a topic graph to analyze captioning context and guides the semantic graphs for captioning. Particularly, graph convolutions and graph transformers with a connection between the adjacency matrices are explored. Experimental results on MS-COCO dataset demonstrate the effectiveness of our proposed method.
0 Replies
Loading