Keywords: Unified Face Attack Detection; Cross-Modal Prompt Tuning; Synonym Semantic Augmentation; Fourier High-Frequency Amplifier.
TL;DR: We propose Cross-Modal Prompt Tuning (CMPT), a bidirectional framework that realigns vision–language models for Unified Face Attack Detection by enriching category semantics and consolidating forgery cues.
Abstract: Pre-trained vision–language models (VLMs), such as CLIP, fail to realize their anticipated superiority in the Unified Face Attack Detection (UAD) task. We attribute this to two task-specific challenges: (1) Categorical ambiguity. UAD categories such as live and fake pose challenges for semantic alignment in CLIP, as they are subjectively defined concepts rather than literal meanings. (2) Forgery diversity. The diversity of forgery cues across physical and digital attacks hinders the language modality from delineating reliable decision boundaries. To address these issues, we propose Cross-Modal Prompt Tuning (CMPT), a bidirectional prompt-transfer framework that realigns vision and language. In the language branch, Synonym Semantic Augmentation (SSA) retrieves semantically related neighbors from a frozen vocabulary and integrates them via similarity-weighted aggregation, enriching category semantics and targeting comprehensive coverage of category expressions. In vision branch, a Fourier-based High-Frequency Amplifier (FHFA) suppresses low frequencies and adaptively strengthens the real and imaginary components of high-frequency signals with learnable convolutions, consolidating diverse forgery cues into a shared discriminative space. Within UAD-CMPT, the resulting semantically augmented categories are sent to the vision branch, and instance-conditioned visual prompts encoding decision criteria are returned to the language branch; both act as learnable prompts to achieve vision–language alignment. Extensive experiments demonstrate that UAD-CMPT is consistently outperforms state-of-the-art (SOTA) methods on multiple UAD benchmarks.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2954
Loading