Abstract: Face anti-spoofing (FAS) based on domain generalization (DG) has garnered increasing attention from researchers. The poor generalization is attributed to the model being overfitted to salient liveness-irrelevant signals. Previous methods addressed this issue by either mapping images from multiple domains into a common feature space or promoting the separation of image features from domain-specific and task-related features. However, direct manipulation of image features inevitably disrupts semantic structure. Utilizing the text features of vision-language pre-trained (VLP) models, such as CLIP, to dynamically adjust image features offers the potential for better generalization, exploring a broader feature space while preserving semantic information. Specifically, we propose a FAS method called style-conditional prompt token learning (S-CTPL), which aims to generate generalized text features by training introduced prompt tokens to encode visual styles. These tokens are then utilized as weights for classifiers, enhancing the model's generalization. Unlike inherently static prompt tokens, our dynamic prompt tokens adaptively capture live-irrelevant signals from instance-specific styles, increasing their diversity through mixed feature statistics to further mitigate model overfitting. Thorough experimental analysis demonstrates that S-CPTL outperforms current top-performing methods across four distinct cross-dataset benchmarks.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: This work contributes to multimedia/multimodal processing by leveraging vision-language pre-trained (VLP) models, such as CLIP, to improve face anti-spoofing (FAS) performance in domain generalization (DG) settings. The proposed style-conditional prompt token learning (S-CTPL) method utilizes the text features of VLP models to dynamically adjust image features, enabling better generalization while preserving semantic information. This multimodal approach explores a broader feature space compared to traditional methods that directly manipulate image features, which can disrupt semantic structure. By introducing dynamic prompt tokens that adaptively capture live-irrelevant signals from instance-specific styles and increase diversity through mixed feature statistics, S-CTPL effectively mitigates model overfitting. The successful integration of visual and textual features in this work highlights the potential of multimodal processing in enhancing the robustness and generalizability of FAS systems across diverse datasets. This contribution demonstrates the value of leveraging cross-modal information to tackle challenges in multimedia applications, such as combating face spoofing attacks, and opens up new avenues for research in multimodal learning for domain generalization tasks.
Submission Number: 639
Loading