Exposing Weaknesses in Emotion Recognition in Conversations

Exposing Weaknesses in Emotion Recognition in Conversations

ACL ARR 2026 January Submission6522 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Emotion Recognition in Conversations, rare emotions predictions, zero-shot generation, error analysis, Large language models

Abstract: Emotion Recognition in Conversations (ERC) aims to identify speakers’ emotions in multi-turn dialogue. While many recent approaches rely on task-specific fine-tuning, such models may exploit dataset-specific cues. To reduce this effect, we study ERC using Large Language Models (LLMs) in a zero-shot setting, incorporating preceding conversational turns as context. Across standard ERC benchmarks, aggregate evaluation metrics mask substantial differences in per-class behavior despite strong overall performance. We observe that errors occur more frequently for utterances with short replies, interjections, negations, and sentence-type markers such as exclamations and interrogatives. These error patterns raise the question of whether they reflect model behavior or properties of the benchmark datasets themselves. To further investigate this issue, we conduct a controlled re-annotation study with four additional human annotators, treating the original dataset annotation as a fifth annotator. Strong annotator agreement is observed in only 35\% of cases, with (≥80\%) neutral utterances accounting for most high-agreement instances, indicating that emotion plausibility is a central issue in ERC evaluation. Finally, we analyze model behavior across different agreement levels and introduce an LLM-as-Judge framework that explicitly evaluates emotion plausibility, allowing multiple emotionally coherent interpretations rather than enforcing a single-label decision.

Paper Type: Long

Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining

Research Area Keywords: Emotion Recognition in Conversations, rare emotions predictions, zero-shot generation, error analysis, Large language models

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 6522

Loading