WaveConViT: Wavelet-Based Convolutional Vision Transformer for Cross-Manipulation Deepfake Video Detection

Mehdi Atamna, Iuliia Tkachenko, Serge Miguet

Published: 01 Jan 2024, Last Modified: 25 Sept 2025ICPR (Workshops and Challenges, 6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The ease of use and wide availability of high-quality deepfake creation tools raises significant concerns about the reliability and trustworthiness of online content, and makes the task of detecting facial tampering more complicated. As such, the development of effective deepfake detection methods is of utmost importance. In recent years, the facial deepfake detection task took a leap thanks to the development of deep learning-based methods as well as the availability of large datasets of high-quality deepfake videos. Despite the aforementioned methods achieving excellent results when tasked with detecting deepfakes generated using methods seen during training, the cross-manipulation, or generalization, task—where a trained model is exposed to unseen manipulation techniques—is a major challenge which is attracting the attention of the research community. In this paper, we introduce WaveConViT, a novel spatio-temporal architecture for deepfake detection based on Vision Transformers and a two-dimensional discrete wavelet transform. Additionally, we introduce and evaluate a temporal sampling strategy based on frame skipping. We extensively test and benchmark this architecture in the challenging cross-manipulation scenario on the FaceForensics++, Celeb-DF, and DeeperForensics-1.0 datasets, comparing it to a selection of modern, representative Vision Transformer (ViT) and convolutional neural network (CNN) architectures and demonstrating the value of high-frequency features as well as our frame skipping strategy for deepfake detection.