DualCap: Dual-Space Understanding with Multi-Model Feedback for Video Emotion Captioning

DualCap: Dual-Space Understanding with Multi-Model Feedback for Video Emotion Captioning

ACL ARR 2026 January Submission4137 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Affective Computing, Multimodal Fusion, Emotional Video Captioning, Hyperbolic Representation, Reinforcement Learning from Human Feedback (RLHF)

Abstract: Generating emotionally aligned language remains a key challenge for large language models. We present \textbf{DualCap}, a multimodal framework that formulates affective understanding as a generation task rather than discrete classification. DualCap performs dual-space reasoning by integrating surface-level multimodal cues with psychologically grounded Valence–Arousal–Dominance (VAD) representations embedded in hyperbolic space. To ensure emotional and linguistic coherence, we introduce a multi-model feedback mechanism where multiple LLMs collaboratively evaluate and refine captions through aggregated affective and dimensional feedback, analogous to affect-oriented reinforcement learning from human feedback (Affective-RLHF). Experiments on DFEW and MAFW show that DualCap achieves strong recognition performance while substantially improving the expressiveness, interpretability, and emotional fidelity of generated language, demonstrating the value of combining cognitive emotion modeling with feedback-driven generation for emotionally intelligent LLMs.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability

Languages Studied: Our experiments are conducted on English video datasets (DFEW and MAFW). The proposed DualCap framework is language-agnostic and can be generalized to other languages given corresponding training data.

Submission Number: 4137

Loading