Abstract: Face forgery detection has been a widespread issue recently due to the adverse effects of face forgery techniques on social media. The state-of-the-art deep learning based methods commonly employ low-level texture features for face forgery detection, since most face forgery methods have difficulty simulating low-level signals in natural images. However, most existing methods only visit the low-level features from the spatial or temporal perspective. In this work, we revisit the face forgery detection problem from a spatio-temporal perspective to cover both for better generalization performance. Specifically, we propose a Spatio-Temporal Difference Network (STDN) to mine low-level clues for face forgery detection. The network contains three different but complementary branches 1) high-frequency channel difference images, 2) inter-frame residual signals, and 3) raw RGB images. It is able to capture face forgery traces through a three-branch collaborative learning framework. Furthermore, we propose a multimodal attention fusion module to effectively fuse the complementary features from different branches. Through comprehensive experiments on several publicly available datasets, we demonstrate the superior performance of the proposed STDN. The effectiveness of low-level spatio-temporal clues in a collaborative learning framework could potentially guide future work in face forgery detection.
Loading