Abstract: The emergence of hyper-realistic deepfake videos has raised signif-
icant concerns regarding their potential misuse. However, prior re-
search on deepfake detection has primarily focused on image-based
approaches, with little emphasis on video. With the advancement
of generation techniques enabling intricate and dynamic manip-
ulation of entire faces as well as specific facial components in a
video sequence, capturing dynamic changes in both global and local
facial features becomes crucial in detecting deepfake videos. This
paper proposes a novel sequential attentive face embedding, SAFE,
that can capture facial dynamics in a deepfake video. The proposed
SAFE can effectively integrate global and local dynamics of facial
features revealed in a video sequence using contrastive learning.
Through a comprehensive comparison with the state-of-the-art
methods on the DFDC (Deepfake Detection Challenge) dataset and
the FaceForensic++ benchmark, we show that our model achieves
the highest accuracy in detecting deepfake videos on both datasets.
Loading