Beyond Audio-Visual Alignment: Unmasking Talking Head Deepfakes via Red Hue Discrepancies in HSV Color Space
Keywords: audio-visual deepfake, fake detection, talking head generation, unsupervised learning
TL;DR: We propose an unsupervised red hue-based detector for identifying talking head deepfakes across diverse generation paradigms, supported by a novel radiance field-based dataset for evaluation.
Abstract: Deepfake video detection is crucial in preventing the dissemination of harmful forged audio-visual content. However, the lack of radiance field-based videos in current audio-visual forgery datasets presents a limitation that impedes comprehensive evaluation of detection models. To address this issue, we introduce Radiance Field Audio-Visual (RFAV) dataset, comprising fake videos synthesized using Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), to fill this data gap. As for detection models, existing methods primarily focus on audio-visual mismatches and demonstrate limited effectiveness when applied to forged videos with highly synchronized lip movements. To address this challenge, we rethink talking head deepfakes from a novel perspective based on the distribution of red hue in the HSV color space. We find real and forged videos exhibit distinct differences in the HSV color space, particularly in regions of intense facial motion. Based on this observation, we propose a Red Hue-based Talking Head Forgery Detection (RHTHFD) model. This unsupervised learning framework employs visual region attention to adaptively fuse HSV and visual features, while integrating re-weighted speech features to improve the generalization of deepfake detection. Our method achieves state-of-the-art performance on multiple evaluation benchmarks, including the proposed radiance field-based RFAV dataset.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7116
Loading