Unlocking Explainable and Effective Multimodal Affective Reasoning via Large Language Models

Junjie Liao, Jiandian Zeng, Binbin SONG, Mengting Zhou, Xiaopeng Fan, Tian Wang

Published: 19 Mar 2026, Last Modified: 25 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Multimodal affective analysis, which integrates textual, visual, and acoustic signals, has shown great promise in emotion recognition. However, existing neural approaches often lack interpretability, limiting their trustworthiness in real-world applications. To address this, we propose an Explainable Multimodal Affective Reasoning Framework (EMARF), which combines Multimodal Large Language Models (MLLMs) for modality-specific feature extraction, a consistency-guided reasoning mechanism, and lightweight LoRA fine-tuning. EMARF unifies fast classification and Chain-of-Thought (CoT) reasoning in a single framework. Guided by modality-aware prompts, the model learns to adaptively choose between direct prediction and stepwise reasoning, enabling cognitively inspired and explainable decision-making. Experimental results demonstrate that EMARF achieves state-of-the-art performance on multiple benchmarks while maintaining efficiency and transparency.