AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

Zheng Lian; Haoyu Chen; Lan Chen; Haiyang Sun; Licai Sun; Yong Ren; Zebang Cheng; Bin Liu; Rui Liu; Xiaojiang Peng; Jiangyan Yi; Jianhua Tao

AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, Jianhua Tao

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY-NC 4.0

Abstract:

The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level—from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https://github.com/zeroQiaoba/AffectGPT.

Lay Summary:

How can we teach machines to truly understand human emotions in videos? Current AI struggles with nuanced emotions because most datasets are too small and lack detailed annotations. To fix this, we introduce MER-Caption, the largest descriptive emotion dataset ever built—with over 2K fine-grained emotion categories and 115K video samples. Using a novel crowd-sourcing approach driven by AI, we gathered detailed emotional descriptions to improve machine learning models.

We also developed AffectGPT, a new AI model designed to better integrate video and text information for emotion understanding. Unlike traditional methods that treat emotions as simple labels, AffectGPT generates natural language descriptions of emotions, making its predictions more interpretable and human-like. To evaluate it, we created MER-UniBench, a new benchmark with metrics tailored for both typical MER tasks and MLLM-based free-form outputs.

Our experiments show AffectGPT performs strongly across various emotion recognition tasks. By releasing our dataset and model, we aim to advance research in multimodal emotion understanding and make AI systems more emotionally intelligent.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/zeroQiaoba/AffectGPT

Primary Area: Applications

Keywords: multimodal emotion recognition, AffectGPT, MER-Caption, MER-UniBench

Submission Number: 11009

Loading