X3A: Efficient Multimodal Deepfake Detection with Score-Level Fusion

Chan Park, Bohyun Moon, Minsun Jeon, Jee-weon Jung, Simon S. Woo

Published: 31 Mar 2025, Last Modified: 28 Feb 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Advances in deepfake generation have highlighted the necessity for sophisticated detection methods and realistic datasets to ensure models are effectively generalized. While traditional datasets focused on unimodal manipulations, the emergence of multimodal datasets, which include audio-visual forgeries, increased the complexity of deepfake detection. The recent release of the LAV-DF and AV-Deepfake1M datasets featured partial manipulations in multimodal contents and underscored the need for effective videolevel detection methods to identify these forgeries. In this work, we propose X3A, an efficient multimodal video deepfake detection model exploiting two powerful unimodal models with probabilistic score-level fusion. X3A leverages the advantage of using raw visual and audio inputs without relying on hand-crafted features. We conducted the extensive experiments on multiple different multimodal deepfake benchmark datasets and achieved superior performance on multimodal deepfake detection, successively detecting entirely and partially manipulated scenarios. Our X3A model demonstrates an accuracy of 0.9960 AUC of 0.9999 on the most challenging AV-Deepfake1M benchmark, surpassing all existing models.
Loading