Sentiment Caption Generation from Visual Scene Using Pre-trained Language Model

Xiaochen Zhang, Jin Li, Mengfan Xu, Liangfu Li, Longjiang Guo, Yunpeng Song

Published: 01 Jan 2024, Last Modified: 16 Apr 2025ICIRA (6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Developing artificial emotional intelligence for machines has become a hot topic in human-computer interaction, especially for educational robots. With the success of generative models, emotion-driven content generation, i.e. image captioning, has emerged as a new research problem, enabling robots to produce content more aligned with human habits. However, existing image captioning techniques have overlooked emotional factors, leading to stiff and rough outputs. This paper proposes an emotion-oriented image captioning generation method, which aims to reduce the gap between model outputs and human perception by introducing text sentiment analysis. Specifically, the masked language model is utilized to generate the candidate textual sequence. Then the pre-trained CLIP is introduced to ensure that the generated image descriptions match the visual content of the images. Finally, a text sentiment analysis model is integrated into the proposed framework to enhance emotional expression. Experiments show that compared to existing techniques, the generated captions from this approach better align with the actual semantic content of images.