Keywords: Affective Computing, Multimodal Fusion, Emotional Video Captioning, Hyperbolic Representation, Reinforcement Learning from Human Feedback (RLHF)
Abstract: Generating emotionally aligned language remains a key challenge for large language models. We present \textbf{DualCap}, a multimodal framework that formulates affective understanding as a generation task rather than discrete classification. DualCap performs dual-space reasoning by integrating surface-level multimodal cues with psychologically grounded Valence–Arousal–Dominance (VAD) representations embedded in hyperbolic space. To ensure emotional and linguistic coherence, we introduce a multi-model feedback mechanism where multiple LLMs collaboratively evaluate and refine captions through aggregated affective and dimensional feedback, analogous to affect-oriented reinforcement learning from human feedback (Affective-RLHF). Experiments on DFEW and MAFW show that DualCap achieves strong recognition performance while substantially improving the expressiveness, interpretability, and emotional fidelity of generated language, demonstrating the value of combining cognitive emotion modeling with feedback-driven generation for emotionally intelligent LLMs.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability
Languages Studied: Our experiments are conducted on English video datasets (DFEW and MAFW). The proposed DualCap framework is language-agnostic and can be generalized to other languages given corresponding training data.
Submission Number: 4137
Loading