Abstract: Nowadays, the abuse of AI-generated content (AIGC), especially the facial images known as deepfake, on social networks has raised severe security concerns, which might involve the manipulations of both visual and audio signals. For multimodal deepfake detection, previous methods usually exploit forgery-relevant knowledge to fully finetune Vision transformers (ViTs) and perform cross-modal interaction to expose the audio-visual inconsistencies. However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. In FRADE, the parameters of pretrained ViT are frozen to preserve its prior knowledge, while two well-devised learnable components, i.e., the Adaptive Forgery-aware Injection (AFI) and Audio-distilled Cross-modal Interaction (ACI), are leveraged to adapt forgery relevant knowledge. Specifically, AFI captures high-frequency discriminative features on both audio and visual signals and injects them into ViT via the self-attention layer. Meanwhile, ACI employs a set of latent tokens to distill audio information, which could bridge the domain gap between audio and visual modalities. The ACI is then used to well learn the inherent audio-visual relationships by cross-modal interaction. Extensive experiments demonstrate that the proposed framework could outperform other state-of-the-art multimodal deepfake detection methods under various circumstances.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Existing methods make great efforts to detect unimodal (audio or visual) deepfakes, but they ignore that deepfake videos contain audio and visual information in real-world scenarios, and both modalities can be forged. Therefore, it is urgent for the research community to develop an effective multimodal deepfake detection method that would take both audio and visual forgeries into forensics consideration. This manuscript, titled 'FRADE: Forgery-aware Audio-distilled Multimodal Learning for DeepFake Detection ', proposes a novel deepfake detection framework for detecting audio-visual deepfakes. Specifically, in FRADE, audio and visual inputs are fed into stacked ViT blocks, each of which is employed with two well-devised learnable modules, i.e., Adaptive Forgery-aware Injection (AFI) module and Audio-distilled Cross-modal Interaction (ACI) module, to facilitate pretrained ViTs in learning intrinsic audio-visual relationship and thus capture discriminative audio-visual inconsistencies for detecting deepfake. Compared with previous multimodal methods, our method has the following advancements: (1) To preserve prior knowledge of pretrained ViTs and obtain better generalizability, we keep their parameters frozen and design two learnable modules to introduce forgery-relevant knowledge. (2) By stacking ACI-equipped ViT blocks, FRADE could effectively bridge the domain gap between audio-visual modalities in a progressive way, improving the quality of cross-modal interaction.
Submission Number: 5494
Loading