AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code will be made public.
Primary Subject Area: [Generation] Social Aspects of Generative AI
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: 1. We propose AV-Deepfake1M, a large-scale content-driven audio-visual dataset for the task of temporal deepfake localization. 2. We propose an elaborate data generation pipeline employing novel manipulation strategies and incorporating the state-of-the-art in text, video, and audio generation. 3. We perform comprehensive analysis and benchmarks of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods.
Supplementary Material: zip
Submission Number: 5
Loading