CLIP Facial Expression Recognition: Balancing Precision and Generalization

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Facial expression recognition, CLIP, generalization
TL;DR: We learn sigmoid masks and propose a channel-separation module and a channel-diverse loss to adapt CLIP for FER and achieve high classification accuracy and high generalization ability at the same time.
Abstract: Current facial expression recognition (FER) methods excel in achieving high classification accuracy but often struggle to generalize effectively across various unseen test sets. On the other hand, CLIP demonstrates impressive generalization ability, albeit at the cost of lower classification accuracy compared to SOTA FER methods. In this paper, we propose a novel approach to adapt CLIP for FER, striking a balance between precision and generalization. Our motivation is rooted in the potential of large pre-trained models like CLIP to extract generalizable face features across diverse FER domains, showcasing high generalization ability. However, these extracted face features, which include extra information like age and gender, are not directly suitable for FER tasks, resulting in lower classification accuracy. To solve this problem, we train a traditional FER model to learn sigmoid masks to only select expression-related features from the fixed CLIP face features. The selected features are utilized for classification. To improve the generalization ability of the learned masks, we propose a channel-separation module to map the channels of the masked features directly to logits and avoid using the FC layer to reduce overfitting. We also introduce a channel-diverse loss to make the learned masks as diverse as possible. Extensive experiments on numerous FER datasets verify that our method outperforms SOTA FER methods by large margins. Based on both the high classification accuracy and generalization ability, our proposed method has the potential to become a new paradigm in the FER field. The code will be available.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2421
Loading