CH-CEMS: A Chinese Multi-Concept Benchmark Dataset Towards Explainable Multi-Modal Sentiment Analysis

Zhuohang Li; Hua Xu; Peiwu Wang; Yingying Wang; Hanlei Zhang; Qianrui Zhou; Yuetian Zou; Long Xiao

CH-CEMS: A Chinese Multi-Concept Benchmark Dataset Towards Explainable Multi-Modal Sentiment Analysis

Zhuohang Li, Hua Xu, Peiwu Wang, Yingying Wang, Hanlei Zhang, Qianrui Zhou, Yuetian Zou, Long Xiao

17 Sept 2025 (modified: 08 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Sentiment Analysis, Dataset, Reinforcement Learning, Multimodal Large Language Models

TL;DR: We introduce a Chinese multimodal sentiment dataset with multi-concept annotations for explainable multimodal sentiment analysis, a concept-guided reinforcement-learning method for MLLM, and comprehensive benchmarks.

Abstract: Explainable Multimodal Sentiment Analysis (EMSA) is a booming research area aimed at advancing robust and faithful multimodal language understanding. Recent explainable datasets and methods based on multimodal large language models (MLLMs) have introduced a new paradigm that produces chain-of-thought–style explanations within affective computing. However, high-quality data resources for EMSA remain scarce, largely because annotating reliable reasoning cues is costly and difficult. To address this gap, we introduce CH-CEMS, the first multimodal sentiment dataset for explainable multimodal sentiment analysis. It contains 3,715 curated video segments with polarity and intensity annotations. In addition, we annotate three semantic concepts for each sample (i.e., speaking style, tone of voice, and facial expression), which serve as explicit reasoning cues to enable process-level supervision. To fully leverage these concept cues, we propose a concept-guided reinforcement learning framework with Group Relative Policy Optimization (GRPO) for MLLMs, in which concept-level supervision explicitly constrains cross-modal semantic relations and guides the model to infer sentiment from verifiable concepts. We further establish baselines with state-of-the-art multimodal machine learning methods and MLLMs via zero-shot inference and supervised fine-tuning. Experiments show that MLLMs outperform feature-based methods, typically by 4–12\% in accuracy for three-class sentiment analysis, and that our concept-guided GRPO yields a further 8.5\% improvement, even surpassing closed-source model such as GPT-5. We believe CH-CEMS and the benchmark will facilitate future research on explainable multimodal sentiment analysis. The dataset and codes are avaliable for use at https://anonymous.4open.science/r/CH-CEMS-C34F.

Primary Area: datasets and benchmarks

Submission Number: 8951

Loading