Keywords: Multimodal Emotion-Cause Pair Extraction, Temporal Recency Modeling, Cross-Module Temporal Alignment, Speaker Interaction Graph
Abstract: Multimodal emotion-cause pair extraction (MECPE) is a structured link prediction problem that identifies emotion-cause utterance pairs under temporal precedence. While temporal proximity is a strong cue, modular MECPE architectures that mix sequential aggregation and speaker-interaction modules can encode inconsistent recency profiles across modules, destabilizing pair scoring. We propose ATDG (Adaptive Temporal Decay Generator), a low-capacity generator that maps \emph{label-free} dialogue pace statistics to a dialogue-level time scale, and DP (Dual-Path Temporal Injection), which injects this shared scale into (i) KS (Kernel Smoothing), a kernel-smoothed sequential path that anchors pair scoring, and (ii) SG (Speaker Graph), a temporally decayed speaker-interaction graph path used only for emotion/cause prediction. Sharing a single timescale enforces cross-module temporal coherence without increasing model capacity. To protect the structured pair scorer under multi-task training, we adopt a pair-preserving two-stage schedule: Stage A learns the pair pathway under consistent temporal priors, and Stage B optionally refines the emotion/cause heads with the pair pathway frozen. Experiments on the ECF benchmark show consistent gains in pair extraction (up to 57.92 Pair F1) and robustness to evaluation-time perturbations of the guiding statistics. Code will be released publicly.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: multi-modal dialogue systems
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2078
Loading