AVT$^{2}$-DWF: Improving Deepfake Detection With Audio-Visual Fusion and Dynamic Weighting Strategies

Rui Wang, Dengpan Ye, Long Tang, Yunming Zhang, Jiacheng Deng

Published: 01 Jan 2024, Last Modified: 06 Feb 2025IEEE Signal Process. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this letter, we propose AVT$^{2}$-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT$^{2}$-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an $n$-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT$^{2}$-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection.