Abstract: Facial expression recognition plays a crucial role in understanding human emotions and behavior. However, existing models often exhibit biases and imbalance towards diverse expression classes. To address this problem, we propose an Adaptive Mask-Guide Supervised Network (AMGSN) to enhance the uniform performance of the facial expression recognition models. We propose an adaptive mask guidance mechanism to mitigate bias and ensure uniform performance across different expression classes. AMGSN focuses on learning the ability to distinguish facial features with under-expressed expressions by dynamically generating masks during pre-training. Specifically, we employ an asymmetric encoder–decoder architecture, where the encoder encodes only the unmasked visible regions, while the lightweight decoder reconstructs the original image using latent representations and mask markers. By utilizing dynamically generated masks and focusing on informative regions, these models effectively reduce the interference of confounding factors, thus enhancing the discriminative power of the learned representation. In the pre-training stage, we introduce the Attention-Based Mask Generator (ABMG) to identify salient regions of expressions. Additionally, we advance the Mask Ratio Update Strategy (MRUS), which utilizes image reconstruction loss, to adjust the mask ratio for each image during pre-training. In the finetune stage, debiased center loss and contrastive loss are introduced to optimize the network to ensure the overall performance of expression recognition. Extensive experimental results on several standard datasets demonstrate that the proposed AMGSN significantly improves both balance and accuracy compared to state-of-the-art methods. For example, AMGSN reached 89.34% on RAF-DB, and 62.83% on AffectNet, respectively, with a standard deviation of only 0.0746 and 0.0484. This demonstrates the effectiveness of our improvements1.
Loading