Improved Deepfake Video Detection Using Convolutional Vision Transformer

Deressa Wodajo, Peter Lambert, Glenn Van Wallendael, Solomon Atnafu, Hannes Mareen

Published: 2024, Last Modified: 06 May 2026GEM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deepfakes are hyper-realistic videos in which the faces are replaced, swapped, or forged using deep-learning models. This potent media manipulation techniques hold promise for applications across various domains. Yet, they also present a significant risk when employed for malicious intents like iden-tity fraud, phishing, spreading false information, and executing scams. In this work, we propose a novel and improved Deepfake video detector that uses a Convolutional Vision Transformer (CViT2), which builds on the concepts of our previous work (CViT). The CViT architecture consists of two components: a Convolutional Neural Network that extracts learnable features, and a Vision Transformer that categorizes these learned features using an attention mechanism. We trained and evaluted our model on 5 datasets, namely Deepfake Detection Challenge Dataset (DFDC), $\mathbf{FaceForensics++} \ (\text{FF}++)$ I, Celeb-DF v2, Deep-fakeTIMIT, and TrustedMedia. On the test sets unseen during training, we achieved an accuracy of 95 %, 94.8 %, 98.3 % and 76.7% on the DFDC, $\mathbf{FF ++}$, Celeb-DF v2, and TIMIT datasets, respectively. In conclusion, our proposed Deepfake detector can be used in the battle against misinformation and other forensic use cases.

External IDs:dblp:conf/gamesem/WodajoLWAM24