Abstract: Recognizing dynamic facial expressions (DFER) in real-world videos is challenging due to frequent transitions between non-target and target expressions within a single clip. For instance, a “happiness” video may include neutral or ambiguous expressions before reaching the target emotion. Existing methods discard or down weight non-target frames, we argue that these frames contain critical transitional cues for robust recognition. To address this, we propose the Momentum-Based Prototype-Centered Learning (PCM) framework, which leverages non-target frames through two key innovations. First, the Prototype-based Pseudo-label Generator (PPG) dynamically assigns pseudo-labels to non-target frames using momentum-updated class prototypes, enabling iterative refinement of their semantic contributions. Second, the Local-Global Temporal Feature Encoder (LGTFE) captures fine-grained variations in specific facial regions (e.g., eye narrowing in suppressed anger) while modeling global expression evolution. Experiments on DFEW, MAFW, and FERV39k show that PCM achieves robust performance. These results highlight that strategically integrating non-target frames through prototype-guided learning and multi-scale temporal modeling significantly enhances real-world DFER accuracy.
External IDs:dblp:journals/dsp/LiangXTS26
Loading