Abstract: Recently, deep-learning based model has been widely used for deepfake video detection due to its effectiveness in artifacts extraction. Most of the existing deep-learning detection methods with the attention mechanism attach more importance to the information in the spatial domain. However, the discrepancy of different frames is also important and should pay different levels of attention to temporal regions. To address this problem, this paper proposes an Attention Guided LSTM Network (AGLNet), which takes into consideration the mutual correlations in both temporal and spatial domains to effectively capture the artifacts in deepfake videos. In particular, sequential feature maps extracted from convolution and fully-connected layers of the convolutional neural network are receptively fed into the attention guided LSTM module to learn soft spatio-temporal assignment weights, which help aggregate not only detailed spatial information but also temporal information from consecutive video frames. Experiments on FaceForensics++ and Celeb-DF datasets demonstrate the superiority of the proposed AGLNet model in exploring the spatio-temporal artifacts extraction.
0 Replies
Loading