FER-Former: Multimodal Transformer for Facial Expression Recognition

Yande Li, Mingjie Wang, Minglun Gong, Yonggang Lu, Li Liu

Published: 2025, Last Modified: 18 Oct 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The ever-increasing demands for intuitive interactions in virtual reality have led to surging interests in facial expression recognition (FER). There are however several issues commonly seen in existing methods, including narrow receptive fields and homogenous supervisory signals. To address these issues, we propose in this paper a novel multimodal supervision-steering transformer for facial expression recognition in the wild, referred to as FER-former. Specifically, to address the limitation of narrow receptive fields, a hybrid feature extraction pipeline is designed by cascading both prevailing CNNs and transformers. To deal with the issue of homogenous supervisory signals, a heterogeneous domain-steering supervision module is proposed to incorporate text-space semantic correlations to enhance image features, based on the similarity between image and text features. Additionally, a FER-specific transformer encoder is introduced to characterize conventional one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. Based on the collaboration of multifarious token heads, global receptive fields with multimodal semantic cues are captured, delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-art methods.

External IDs:dblp:journals/tmm/LiWGLL25