Keywords: Emotion Recognition in Conversations, rare emotions predictions, zero-shot generation, error analysis, Large language models
Abstract: Emotion Recognition in Conversations (ERC) aims to identify speakers’ emotions in multi-turn dialogue. While many recent approaches rely on task-specific fine-tuning, such models may exploit dataset-specific cues. To reduce this effect, we study ERC using Large Language Models (LLMs) in a zero-shot setting, incorporating preceding conversational turns as context. Across standard ERC benchmarks, aggregate evaluation metrics mask substantial differences in per-class behavior despite strong overall performance. We observe that errors occur more frequently for utterances with short replies, interjections, negations, and sentence-type markers such as exclamations and interrogatives. These error patterns raise the question of whether they reflect model behavior or properties of the benchmark datasets themselves.
To further investigate this issue, we conduct a controlled re-annotation study with four additional human annotators, treating the original dataset annotation as a fifth annotator. Strong annotator agreement is observed in only 35\% of cases, with (≥80\%) neutral utterances accounting for most high-agreement instances, indicating that emotion plausibility is a central issue in ERC evaluation. Finally, we analyze model behavior across different agreement levels and introduce an LLM-as-Judge framework that explicitly evaluates emotion plausibility, allowing multiple emotionally coherent interpretations rather than enforcing a single-label decision.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Emotion Recognition in Conversations, rare emotions predictions, zero-shot generation, error analysis, Large language models
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 6522
Loading