Domain Knowledge Enhanced Vision-Language Pretrained Model for Dynamic Facial Expression Recognition
Abstract: Dynamic facial expression recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video sequences. However, the complex temporal modeling caused by noisy frames, along with the limited training data significantly hinder the further development of DFER. Previous efforts in this domain have been limited as they tackled these issues separately. Inspired by recent advances of pretrained vision-language models (e.g., CLIP), we propose to leverage it to jointly address the two limitations in DFER. Since the raw CLIP model lacks the ability to model temporal relationships and determine the optimal task-related textual prompts, we utilize DFER-specific domain knowledge, including characteristics of temporal correlations and relationships between facial behavior descriptions at different levels, to guide the adaptation of CLIP to DFER. Specifically, we propose enhancements to CLIP's visual encoder through the design of a hierarchical video encoder that captures both short- and long-term temporal correlations in DFER. Meanwhile, we align facial expressions with action units through prior knowledge to construct semantically rich textual prompts, which are further enhanced with visual contents. Furthermore, we introduce a class-aware consistency regularization mechanism that adaptively filters out noisy frames, bolstering the model's robustness against interference. Extensive experiments on three in-the-wild dynamic facial expression datasets demonstrate that our method outperforms the state-of-the-art DFER approaches. The code is available at https://github.com/liliupeng28/DK-CLIP.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Human emotional signals typically include multimodal signals, such as facial expressions, text, and speech, with facial expressions being the primary ones. Our work develop a novel method that utilizes the enhanced vision-language pretrained (CLIP) model to promote the advancement of dynamic facial expression recognition. Accordingly, our work not only contributes to the analysis of emotional signals, but also provides a new way for adapting vision-language pretrained models to downstream tasks.
Supplementary Material: zip
Submission Number: 4557
Loading