Abstract: Flexible modal Face Anti-spoofing (FAS) aims to aggregate all the available training modalities’ data to train a model, and enables flexible testing of any given modal samples. Previous works introduce shared cross-modal transformers (attentions) to facilitate the learning of modality-agnostic features, which inevitably leads to the distortion of feature structures and achieves limited performance. In this work, borrowing a solution from the large-scale vision-language models (VLMs) instead of directly removing modality-specific signals from visual features, we propose a novel Flexible Modal CLIP (\textbf{FM-CLIP}) for flexible modal FAS, that can utilize text features to dynamically adjust visual features to be modality independent. In the visual branch, considering the huge visual differences of the same attack in different modalities, which makes it difficult for classifiers to flexibly identify subtle spoofing clues in different test modalities, we propose Cross-Modal Spoofing Enhancer (\textbf{CMS-Enhancer}). It includes a Frequency Extractor (\textbf{FE}) and Cross-Modal Interactor (\textbf{CMI}), aiming to map different modal attacks in a shared frequency space to reduce interference from modality-specific signals and enhance spoofing clues by leveraging cross modal learning from the shared frequency space. In the text branch, we introduce a Language-Guided Patch Alignment (\textbf{LGPA}) based on the prompt learning, which further guides the image encoder to focus on patch level spoofing representations through dynamic weighting by text features. Thus, our FM-CLIP can flexibly test different modal samples by identifying and enhancing modality-agnostic spoofing cues. Finally, extensive experiments show that FM-CLIP is effective and outperforms state-of-the-art methods on multiple multi-modal datasets.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: In this paper, we build a flexible modal CLIP (FM-CLIP) framework based on the large visual language model to handle flexible-modality face-anti spoofing tasks. The FM-CLIP cleverly uses signals generated by language models to guide visual encoding to focus on spoofing clues in the image. Secondly, we add a cross-modal fake cue feature enhancer in the visual branch to complementary learn cross-modal features. Both effectively solve the limited performance caused by the feature distortion caused by traditional methods using only pure visual encoders. It provides a new solution idea for flexible modal FAS.
Submission Number: 638
Loading