Abstract: Face Anti-Spoofing (FAS) is essential for securing face recognition systems against presentation attacks. Recent advances in sensor technology and multimodal learning have enabled the development of multimodal FAS systems. However, existing methods often struggle to generalize to unseen attacks and diverse environments due to two key challenges: (1) Modality unreliability, where sensors such as depth and infrared suffer from severe domain shifts, impairing the reliability of cross-modal fusion; and (2) Modality imbalance, where over-reliance on a dominant modality weakens the model’s robustness against attacks that affect other modalities. To overcome these issues, we propose MMDG++, a multimodal domain-generalized FAS framework built upon the vision-language model CLIP. In MMDG++, we design the Uncertainty-Guided Cross-Adapter++ (U-Adapter++) to filter out unreliable regions within each modality, enabling more reliable multimodal interactions. Additionally, we introduce Rebalanced Modality Gradient Modulation (ReGrad) for adaptive gradient modulation to balance modality convergence. To further enhance generalization, propose Asymmetric Domain Prompts (ADPs) that leverage CLIP’s language priors to learn generalized decision boundaries across modalities. We also develop a novel multimodal FAS benchmark to evaluate generalizability under various deployment conditions. Extensive experiments across this benchmark show our method outperforms state-of-the-art FAS methods, demonstrating superior generalization capability.
External IDs:dblp:journals/pami/LinLYCWYWLCK25
Loading