Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information

Zhen Yang, Hongxia Zhao, Ping Jian

Published: 01 Jan 2024, Last Modified: 28 Mar 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image paragraph captioning aims to describe given images by generating natural paragraphs. Unfortunately, the paragraphs generated by existing methods typically suffer from poor coherence since the visual information is inevitably lost after the pooling operation, which maps numerous visual features to only one global vector. On the other hand, the pooled vectors make it harder for the language models to interact with details in images, leading to generic or even wrong descriptions of visual details. In this paper, we propose a simple yet effective module called Visual Information Enhancement Module (VIEM) to prevent the visual information loss in visual features pooling. Meanwhile, to model the inter-sentence dependency, a fusion gate mechanism, which makes the most of the nonpooled features by fusing visual vectors with textual information, is introduced into the language model to furthermore improve the paragraph coherence. In experiments, the visual information loss is quantitatively measured through a mutual information based method. Surprisingly, the results indicates that such loss in VIEM is only approximately 50% of that in pooling, effectively demonstrating the efficacy of VIEM. Moreover, extensive experiments on Stanford image-paragraph dataset show that the proposed method achieves promising performance compared with existing methods1.