LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Weichu Liu; Jing Xiong; Yuxuan Hu; Zixuan Li; Minghuan Tan; Ningning Mao; Chenyang Zhao; Zhongwei Wan; Chaofan Tao; Wendong XU; Hui Shen; Chengming Li; Lingpeng Kong; Ngai Wong

LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong XU, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong

18 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Emotional Intelligence, Long-Context

Abstract: Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under $\textit{realistic, practical settings}$ where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present $\textit{LongEmotion}$, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including $\textbf{Emotion Classification}$, $\textbf{Emotion Detection}$, $\textbf{Emotion QA}$, $\textbf{Emotion Conversation}$, $\textbf{Emotion Summary}$, and $\textbf{Emotion Expression}$. On average, the input length for these tasks reaches 8${,}$777 tokens, with long-form generation required for $\textit{Emotion Expression}$. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation ($\textit{RAG}$) and Collaborative Emotional Modeling ($\textit{CoEM}$), and compare them with standard prompt-based methods. Unlike conventional approaches, our $\textit{RAG}$ method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The $\textit{CoEM}$ method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both $\textit{RAG}$ and $\textit{CoEM}$ consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more $\textit{practical and real-world EI applications}$. Furthermore, we conduct a detailed case study on the performance comparison among GPT series models, the application of CoEM in each stage and its impact on task scores, and the advantages of the LongEmotion dataset in advancing EI. All of our code and datasets will be open-sourced, which can be viewed at the anonymous repository link https://anonymous.4open.science/r/anonymous-578B.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10412

Loading