Abstract: Highlights•RGB and motion inputs provide distinctive spatial and temporal information.•Detaching ID-related components from original embeddings can improve the generalization capabilities of the Deepfake detector.•The Swin Transformer are powerful in modeling spatiotemporal embeddings for classification.•An effectively implemented face-cropping strategy minimizes the influence of background elements.
Loading