Abstract: Current encoder-decoder methods for image captioning mai-nly consist of an object detection module (two-stage), or rely on big models with large-scale datasets to improve the effectiveness, which leads to increasing computation costs and cannot introduce new external knowledge. In this paper, we propose a novel end-to-end method Multi-grained Retrieval Augmentation Transformer (M-RAT) that innovatively fuses retrieved text derived from a changeable datastore with input visual feature through a Multi-modal Aligned Encoder, and introduce a specialized attention mechanism, Multi-MSA, to exploit both local and global interactions for delicate fine-grained details. Additionally, we enhance the decoder generation ability by employing low-level and high-level fused embeddings. Experiments demonstrate that M-RAT achieves comparable performance to state-of-the-art baselines with remarkable accuracy and details, as well as showing excellent domain adaptability for novel objects.
Loading