Video forgery detection based on multilocal feature and spatiotemporal fusion

Peize Hu, Mingwen Shao, Lingzhuang Meng, Yuanjian Qiao

Published: 01 Jan 2025, Last Modified: 04 Nov 2025J. Electronic Imaging 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Existing video forgery detection methods are able to identify manipulated faces by focusing on the entire facial region within a single frame or inconsistencies between frames. However, these methods neglect the correlation between spatial and temporal features, leading to overfitting to specific regions and poor generalization across different scenarios. To alleviate the above issues, we propose a video forgery detection scheme based on multilocal features and spatiotemporal fusion, called MLS, to capture forgery clues in both spatial and temporal domains. Specifically, on the one hand, we design a multilocal feature block (MLFB) to extract local spatial features from three nonoverlapping regions, thereby alleviating the risk of overfitting specific areas. In addition, by minimizing the mutual information between features extracted from different local regions, the model learns independent forgery clues, thereby enhancing its generalization ability. On the other hand, we propose a spatial-temporal feature block (SPFB) to extract temporal features between video frames to supplement space forgery clues. By integrating multilocal and spatial-temporal features, our MLS can be more adaptable and robust in complex scenarios. Numerous experimental results have proven that our MLS outperforms state-of-the-art deep fake video detection methods in terms of detection accuracy and robustness both in-dataset and across-dataset in various deepfake videos.