Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

Published: 10 Jun 2026, Last Modified: 10 Jun 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite strong recent progress in Emotion Recognition in Conversation (ERC), two gaps remain: we still lack a clear understanding of which modeling choices materially affect performance, and we have limited linguistic analysis that connects recognition findings to interpretable discourse-level patterns. We address both gaps via a systematic study on IEMOCAP, with a cross-dataset validation on MELD that supports the saturation framing while clarifying which effects are corpus-specific. For recognition, we conduct controlled ablations with 10 random seeds and paired tests over seeds, with correction for multiple comparisons, yielding three findings. First, conversational context is the dominant factor: performance saturates quickly, with roughly 90% of the gain observed within our context sweep achieved using only the most recent 10–30 preceding turns, depending on the label set. Second, hierarchical sentence representations are most useful in utterance-only settings, with a clear advantage on MELD, but the benefit vanishes once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, a simple integration of an external affective lexicon, SenticNet, does not improve results, consistent with pretrained encoders already capturing much of the affective signal needed for ERC. Under a strictly causal, past-only setting, our simple models attain strong performance, 82.69% 4-way and 67.07% 6-way weighted F1, indicating that competitive accuracy is achievable without access to future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position within the utterance, p < .0001. In particular, Sad utterances show reduced left-periphery marker usage, 21.9%, relative to other emotions, 28–32%, aligning with accounts that link left-periphery markers to active discourse management. This pattern is consistent with our recognition results, where Sad benefits most from conversational context, +22 percentage points, suggesting that sadness may be more context-dependent in this corpus than emotions with stronger local pragmatic cues.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We also added a code availability statement and prepared a public reproducibility repository at https://github.com/phillelenina/causal-erc-context-saturation. We updated the paper metadata, author information, references, and citation information to match the accepted camera-ready version. We substantially revised the manuscript in response to the reviewers' comments. 1. We revised the title and overall framing to remove unsupported generation-oriented claims. This addresses Reviewer Xxwu's and Reviewer HXnf's concern that the previous title and abstract over-promised implications for emotion-conditioned generation without a generation experiment. The revised manuscript presents the discourse-marker analysis as descriptive linguistic evidence for interpreting ERC results, with generation left as future work. 2. We reframed the context-length analysis around saturation rather than peak K. This directly addresses Reviewer D7po's concern that Figure 3 over-interpreted large optimal K values and could imply that very long histories are required. Figure 3 now reports saturation K, defined as the earliest context length at which 90% of the observed F1 gain over K=0 is achieved, and the text clarifies that late numerical maxima fall within a post-saturation plateau. 3. We added an explicit in-text reference to Figure 3 in Section 3.7 and revised the surrounding discussion. This addresses Reviewer D7po's request to better integrate Figure 3 into the main narrative. 4. We added explicit discussion of the K=0 baseline and clarified that conversational context complements, rather than replaces, utterance-level lexical and pragmatic cues. This addresses Reviewer D7po's point that per-emotion F1 at K=0 is already non-trivial and that long conversational history should not be treated as strictly necessary. 5. We added a cross-dataset validation on MELD using the same frozen Sentence-RoBERTa, text-only, causal protocol. This addresses the single-dataset concerns raised by Reviewer D7po, Reviewer Xxwu, and Reviewer HXnf. The MELD results support the saturation framing while clarifying that saturation K and the most context-dependent emotion are corpus-specific. 6. We added a Transformer vs. LSTM aggregator comparison to address Reviewer HXnf's concern that context saturation might be an artifact of the LSTM. MELD rows report formal 10-seed like-for-like comparisons, while IEMOCAP Transformer rows are clearly marked as single-seed diagnostic checks. We use this analysis as robustness evidence rather than as a formal statistical comparison for IEMOCAP. 7. We expanded the discourse-marker analysis by reporting the small effect size in the main discussion and adding confound-control models for utterance length, speaker identity, and scripted/improvised condition. This addresses Reviewer Xxwu's and Reviewer HXnf's concerns that the discourse-marker effect was small and potentially confounded. The revised interpretation focuses on Sad as the category with the most robust positional effect. 8. We narrowed the SenticNet conclusion to the simple concatenation-based integration scheme tested here. This addresses Reviewer HXnf's concern that the negative SenticNet result should not be generalized to all forms of external knowledge integration. 9. We expanded the discussion of the Happy vs. Excited distinction in the 6-way taxonomy by suggesting possible text-side cues and dimensional alternatives such as valence and arousal. This addresses Reviewer HXnf's suggestion to make the error analysis more practically useful. 10. We added a broader impact and ethical considerations paragraph addressing uncertainty in emotion labels, risks in sensitive applications, cultural and linguistic generalizability, and potential dual-use concerns in real-time emotion monitoring. This addresses Reviewer HXnf's broader impact comments. 11. We corrected broken references, updated table captions and statistics where necessary, and marked reviewer-requested changes in blue throughout the revised manuscript, as requested by the action editor.
Code: https://github.com/philhelenina/causal-erc-context-saturation
Assigned Action Editor: ~Ali_Etemad1
Submission Number: 6840
Loading