Keywords: Multimodal Dialogue Dataset, Mandarin Emotional Dialogue, Auditory-Visual Emotion Modeling
Abstract: Face-to-face audiovisual interaction is fundamental to human communication, conveying rich and spontaneous emotional expressions. However, existing multimodal dialogue datasets suffer from irregular framing, insufficient coverage of upper-body dynamics, limited emotional diversity, small scale, and a lack of genuine spontaneity. We introduce EmoDialogCN, a large-scale auditory–visual–emotion multimodal dataset specifically designed to capture the richness and spontaneity of real-world face-to-face dialogues. The dataset comprises 21,880 dialogue sessions performed by 119 professional actors across 20 realistic scenarios and 18 emotion categories, totaling 400 hours of recordings—the largest and most comprehensive of its kind. A novel data collection framework minimizes equipment interference and ensures authentic multimodal signals. Actors were encouraged to improvise based on their understanding of the context, allowing spontaneous emotions to emerge naturally. EmoDialogCN achieves superior quality metrics, including natural and clear emotional expressions confirmed by subjective evaluations (average inter-rater std = 0.12), lower emotion distribution deviation (0.64 vs. 5.65), consistent subject framing (52–59\% occupancy), and comprehensive coverage of facial and upper-body expressions. Models trained on this dataset generate contextually appropriate facial expressions, natural body movements, and realistic speaker–listener dynamics, underscoring the value of authentic spontaneous emotional data. The dataset is publicly available at: https://github.com/EmoDialogCN/EmoDialog
Primary Area: datasets and benchmarks
Submission Number: 18098
Loading