Spatiotemporal Predictive Representation Learning for Deepfake Video Detection

Anonymous

25 Feb 2022 (modified: 05 May 2023)ACMMM 2022 Track Reproducibility Blind SubmissionReaders: Everyone
Keywords: Deepfake video detection, predictive representation learning, self-supervised learning, masked modeling
Abstract: Deepfake video detection is very challenging due to the hyper-realistic synthesis of frame data. The key is learning discriminative representations to capture fine-grained spatiotemporal cues. Inspired by that, this paper proposes a deepfake video detector learning approach with spatiotemporal predictive representation learning. The detector cascades a short-term transformer-based encoder, a long-term ConvGRU-based aggregator and a SVM-based binary classifier. The encoder and aggregator jointly serve as the backbone and are pretrained in a self-supervised manner. Specially, the encoder is pretrained with a 3D masked autoencoder (3D-MAE) to extract general short-term semantic features, while the aggregator is pretrained with a memory-augmented feature predictor to turn short-term semantic features into long-term context features. In this way, the extracted features can simultaneously describe the latent patterns of videos across frames spatially and temporally in a unified way. Finally, the classifier is trained to classify the context features along with the fine-tuning of the encoder and aggregator, distinguishing fake videos from real ones. Extensive experiments on popular benchmarks clearly demonstrate that the proposed approach significantly outperforms 25 state-of-the-arts, with good generalization ability across datasets.
0 Replies

Loading