Abstract: Deepfake detection attracts widespread attention in the computer vision field. Existing efforts achieve outstanding progress, but there are still significant unresolved issues. Coarse-grained local and global features are insufficient to capture subtle forgery traces from various inputs. Moreover, the detection efficiency is not powerful enough in practical applications. In this paper, we propose a robust and efficient transformer-based deepfake detection (TransDFD) network, which learns more discriminative and general manipulation patterns in an end-to-end manner. Specifically, a robust transformer module is designed to study fine-grained local and global features based on intra-patch locally-enhanced relations as well as inter-patch locally-enhanced global relationships in face images. A novel plug-and-play spatial attention scaling (SAS) module is proposed to emphasize salient features while suppressing less important representations, which can be integrated into any transformer-based models without increasing computational complexity. Extensive experiments on several public benchmarks demonstrate that the proposed TransDFD model outperforms the state-of-the-art in terms of robustness and computational efficiency.
0 Replies
Loading