Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Content] Multimodal Fusion, [Content] Vision and Language
Relevance To Conference: This study focuses on a new task: Open-set Video-based Facial Expression Recognition(OV-FER), aiming to identify unknown facial expression categories in video data and accurately classify known categories. It holds significant implications for the field of multimedia and multimodal processing. Firstly, it provides an effective method to accurately recognize and parse facial expressions in videos, discerning novel classes, thereby enhancing the understanding and analysis of multimedia content, enabling computers to better comprehend emotions depicted in videos. Secondly, this research contributes to the advancement of multimodal processing. Facial expression recognition, as a form of visual information processing, when combined with information from other modalities such as text, offers a more comprehensive and accurate understanding of information. Therefore, this study can play a role in various multimedia applications, including affective computing, human-computer interaction, content-based retrieval, and affective-aware systems, enabling them to handle complex real-world scenarios, thus enhancing the naturalness and friendliness of human-computer interaction.
Supplementary Material: zip
Submission Number: 4871
Loading