Abstract: Developing artificial emotional intelligence for machines has become a hot topic in human-computer interaction, especially for educational robots. With the success of generative models, emotion-driven content generation, i.e. image captioning, has emerged as a new research problem, enabling robots to produce content more aligned with human habits. However, existing image captioning techniques have overlooked emotional factors, leading to stiff and rough outputs. This paper proposes an emotion-oriented image captioning generation method, which aims to reduce the gap between model outputs and human perception by introducing text sentiment analysis. Specifically, the masked language model is utilized to generate the candidate textual sequence. Then the pre-trained CLIP is introduced to ensure that the generated image descriptions match the visual content of the images. Finally, a text sentiment analysis model is integrated into the proposed framework to enhance emotional expression. Experiments show that compared to existing techniques, the generated captions from this approach better align with the actual semantic content of images.