HI2M: Hard Inter- and Intra-Sample Masking for Dynamic Facial Expression Recognition

18 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dynamic facial expression recognition, self-supervised learning, masked auto-encoder
Abstract: Dynamic facial expression recognition (DFER) holds significant potential for real-world applications. While existing methods have achieved promising results, they face two key limitations: (1) dependence on limited labeled data, and (2) equal treatment of all samples and regions during reconstruction, often resulting in suboptimal attention to informative features. Recent masked autoencoder adaptations address the first limitation but inherit the second. To overcome these challenges, we propose Hard Inter- and Intra-sample Masking (HI2M), a novel framework comprising two synergistic components: Hard Sample Mining (HSM) and Hard Region Discovery (HRD). HI2M operates through a dual-masking strategy: First, HSM employs dynamic sample weighting based on inter-sample reconstruction loss deviations, automatically prioritizing challenging cases. Second, HRD utilizes a trainable reinforcement learning mechanism to identify and mask informative spatiotemporal regions by analyzing intra-sample variances, adaptively selecting varying numbers of regions per frame for fine-grained feature localization. This hierarchical approach-from sample-level importance weighting to region-level adaptive masking-enables focused learning on semantically rich facial dynamics while suppressing noise. Our comprehensive experiments on benchmark datasets demonstrate that HI2M significantly outperforms existing approaches, establishing new state-of-the-art performance in DFER.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12366
Loading