Abstract: There has been an increasing focus from researchers on Domain-Generalized (DG) Face Anti-Spoofing (FAS). However, existing methods aim to project a shared visual space through adversarial training, making it difficult to explore the space without losing semantic information. We investigate the inadequacies of DG that result from classifier overfitting to a significantly different domain distribution. To address this issue, we propose a novel Fine-Grained Prompt Learning (FGPL) based on Vision-Language Models (VLMs), such as CLIP, which can adaptively adjust weights for classifiers with text features to mitigate overfitting. Specifically, FGPL first motivates the prompts to learn content and domain semantic information by capturing Domain-Agnostic and Domain-Specific features. Furthermore, our prompts are designed to be category-generalized by diversifying the Domain-Specific prompts. Additionally, we design an Adaptive Convolutional Adapter (AC-adapter), which is implemented through an adaptive combination of Vanilla Convolution and Central Difference Convolution, to be inserted into the image encoder for quickly bridging the gap between general image recognition and FAS task. Extensive experiments demonstrate that the proposed FGPL is effective and outperforms state-of-the-art methods on several cross-domain datasets.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: The Fine-Grained Prompt Learning (FGPL) approach is a groundbreaking contribution to multimodal and multimedia processing by integrating text and visual cues using vision-language models (VLMs) such as CLIP. This method enriches the multimodal learning landscape by leveraging Domain-Specific and Domain-Agnostic prompts, thus significantly enhancing the interpretability and adaptability of multimedia systems in domain-generalized settings. FGPL effectively tackles the challenges in multimedia environments where integrating diverse data types—such as text and images—is critical for maintaining semantic consistency across different domains. By employing a novel Adaptive Convolutional Adapter and Fine-Grained Prompt Learning techniques, FGPL not only preserves but also utilizes the intrinsic semantic details that are often lost in traditional adversarial training methods. This advanced approach improves the robustness and effectiveness of multimedia systems, particularly in security applications like Face Anti-Spoofing, by optimizing the classifier's response to variable multimedia content. The adaptability and precision of FGPL in handling multimodal inputs make it a pivotal development for enhancing multimedia system capabilities across varied and unpredictable environments.
Submission Number: 636
Loading