Unlocking Explainable and Effective Multimodal Affective Reasoning via Large Language Models
Abstract: Multimodal affective analysis, which integrates textual, visual, and acoustic signals, has shown great promise in emotion recognition. However, existing neural approaches often lack interpretability, limiting their trustworthiness in real-world applications. To address this, we propose an Explainable Multimodal Affective Reasoning Framework (EMARF), which combines Multimodal Large Language Models (MLLMs) for modality-specific feature extraction, a consistency-guided reasoning mechanism, and lightweight LoRA fine-tuning. EMARF unifies fast classification and Chain-of-Thought (CoT) reasoning in a single framework. Guided by modality-aware prompts, the model learns to adaptively choose between direct prediction and stepwise reasoning, enabling cognitively inspired and explainable decision-making. Experimental results demonstrate that EMARF achieves state-of-the-art performance on multiple benchmarks while maintaining efficiency and transparency.
Loading