Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges
Keywords: Text-to-Speech, Database, political figures
TL;DR: This paper presents a high-quality speech deepfake dataset for political figures using automated collection and synthesis methods, achieving strong naturalness (NISQA-TTS 3.69) and a 61.9% human misclassification rate.
Presentation Preference: Open to it if recommended by organizers
Abstract: Recent advances in speech synthesis have introduced unprecedented challenges in maintaining voice authenticity, particularly concerning public figures who are frequent targets of impersonation attacks. This paper presents a comprehensive methodology for collecting, curating, and generating synthetic speech data for political figures, along with a detailed analysis of the challenges encountered. We introduce a systematic approach that incorporates an automated pipeline for collecting high-quality bona fide speech samples, featuring transcription-based segmentation that significantly improves the quality of synthetic speech. We experimented with various synthesis approaches, from single-speaker to zero-shot synthesis, and documented the evolution of our methodology. The resulting dataset comprises bonafide and synthetic speech samples from ten public figures, demonstrating superior quality with an NISQA-TTS naturalness score of 3.69 and the highest human misclassification rate of 61.9%.
Submission Number: 22
Loading