Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition

Ziyang Zhang, Xiang Tian, Yuan Zhang, Kailing Guo, Xiangmin Xu

Published: 01 Jan 2024, Last Modified: 19 Feb 2025IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.