Multi-View Emotion Adapter: Structured Internal Emotion Control for Vision–Language Generation

Multi-View Emotion Adapter: Structured Internal Emotion Control for Vision–Language Generation

ACL ARR 2026 January Submission678 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Captioning，Emotion-Controllable Generation，Vision–Language Models，Multimodal Learning

Abstract: Despite strong progress in semantic and visual understanding, multimodal image captioning models still rely on prompts or external constraints for emotion control, preventing emotion from acting as a stable internal factor during generation. This leads to emotional expressions that are unstable and difficult to reproduce across layers. We propose the Multi-View Emotion Adapter (MVEA), a lightweight, plug-and-play Transformer module that converts emotion from an external stylistic cue into an internal control signal that propagates across layers. MVEA modulates hidden states from two complementary views—magnitude and direction—allowing emotion to participate stably in multi-layer generation. We further introduce a unified training objective that jointly constrains semantics, visual alignment, and emotion. To support stable training and evaluation, we construct an image–text–emotion dataset of approximately 25K samples covering seven emotion categories. Experiments across multiple mainstream multimodal models show consistent improvements in emotion controllability (Emotion Score +11%–25%, Emotion Accuracy +9%–15%), with significantly higher emotional relevance in both human and GPT-based evaluations. Notably, MVEA enables open-source models to substantially narrow the gap with strong closed-source models such as GPT-4o. Overall, MVEA provides a scalable and interpretable framework for emotion-controllable image captioning.

Paper Type: Long

Research Area: Natural Language Generation

Research Area Keywords: text-to-text generation，data-to-text generation，cross-modal content generation

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 678

Loading