Keywords: deepfakes, video forgery detection, high-frequency, texture, optical flow, EfficientNet, Swin Transformer
Abstract: The rapid development of video processing technology makes it easy for people to forge videos without leaving visual artifacts. The spread of forged videos may lead to moral and legal consequences and pose a potential threat to people's lives and social stability. So it is significant to identify deepfake video information. Although the previous detection methods have achieved high accuracy, the generalization is poor when facing unprecedented data in the real scene. There are three fundamental reasons. The first is that capturing the general clue of artifacts is difficult. The second is that selecting the appropriate model is challenging in specific feature extraction. The third is that exploiting fully and effectively the extracted features is hard. We find that the high-frequency information in the image and the texture in the shallow layer of the model expose the subtle artifacts. The optical flow of the real video has variations while the optical flow of the deepfake video has rarely variations. Furthermore, consecutive frames in the real video have temporal consistency. In this paper, we propose a dual-branch video forgery detection model named ENST, which integrates parallelly and interactively EfficientNet-B5 and Swin Transformer. Specifically, EfficientNet-B5 extracts the artifacts information of high frequency and texture in the shallow layer of the model. Swin Transformer captures the subtle discrepancies between optical flows. To extract more robust face features, we design a new loss function for EfficientNet-B5. In addition, we also introduce the attention mechanism into EfficientNet-B5 to enhance the extracted features. We conduct test experiments on FaceForensics++ and Celeb-DF (v2) datasets, and comprehensive results show that ENST has higher accuracy and generalization, which is superior to the most advanced methods.
12 Replies
Loading