Relation-Aware Global-Augmented Transformer for TextCaps

Qiang Li, Bing Li, Can Ma

Published: 2022, Last Modified: 17 May 2023ICANN (1) 2022Readers: Everyone

Abstract: Text-based image captioning (TextCaps) task aims to describe the given image reasonably based on scene text and visual objects simultaneously. Although previous works have shown great success, they pay too much attention to the text modality while ignoring other important visual information, and the correlations between objects and text are not fully exploited. Moreover, traditional transformer-based architectures ignore global information reflecting the entire image, which may cause object missing and erroneous reasoning problems. In this paper, we propose a Relation-aware Global-augmented Transformer (RGT) framework to tackle these problems. Specifically, we utilize a scene graph extracted from the image to explicitly model the relative semantic and spatial relationships of objects via a graph convolutional network, which not only enhances the visual representations but also encodes explicit semantic features of objects. Besides, we add a multi-modal alignment (MMA) module as a supplement for the multi-modal transformer to strengthen the association between scene text and objects. Finally, a global-augmented transformer (GAT) is designed to get a more comprehensive representation of the image, which could alleviate object missing and erroneous reasoning problems. Our method outperforms state-of-the-art models on the TextCaps dataset, improving from 105.0 to 107.2 in CIDEr.

0 Replies