Abstract: Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.
External IDs:dblp:journals/visintelligence/ChengLCY25
Loading