Keywords: Image Captioning,Emotion-Controllable Generation,Vision–Language Models,Multimodal Learning
Abstract: Despite strong progress in semantic and visual understanding, multimodal image captioning models still rely on prompts or external constraints for emotion control, preventing emotion from acting as a stable internal factor during generation. This leads to emotional expressions that are unstable and difficult to reproduce across layers. We propose the Multi-View Emotion Adapter (MVEA), a lightweight, plug-and-play Transformer module that converts emotion from an external stylistic cue into an internal control signal that propagates across layers. MVEA modulates hidden states from two complementary views—magnitude and direction—allowing emotion to participate stably in multi-layer generation. We further introduce a unified training objective that jointly constrains semantics, visual alignment, and emotion. To support stable training and evaluation, we construct an image–text–emotion dataset of approximately 25K samples covering seven emotion categories. Experiments across multiple mainstream multimodal models show consistent improvements in emotion controllability (Emotion Score +11%–25%, Emotion Accuracy +9%–15%), with significantly higher emotional relevance in both human and GPT-based evaluations. Notably, MVEA enables open-source models to substantially narrow the gap with strong closed-source models such as GPT-4o. Overall, MVEA provides a scalable and interpretable framework for emotion-controllable image captioning.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: text-to-text generation,data-to-text generation,cross-modal content generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 678
Loading