Emergent Capability in Audio Deepfake Detection

Published: 2025, Last Modified: 29 Aug 2025IWBF 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Given the latest advances in generative AI, the quality of synthetic and cloned voice has improved dramatically. With the release of many generative AI tools to the open-source domain, the generation of high-quality voice targeted toward a particular person is widely accessible to the public. This is becoming a significant AI security issue. While many existing works have shown good detection accuracy on a particular dataset, models trained on a particular dataset often generalize poorly to new datasets, which sometimes results in close to random performance on new datasets. This work is motivated by the recent advances in AI, such as large language models, where a model shows emergent generalization capability to new datasets by having a very large training dataset. By collecting deepfake detection datasets from the public domain, enhanced by in-house automatic synthetic speech data generation, we built one of the largest deepfake detection datasets, in terms of Text-to-Speech (TTS) algorithm coverage. Our evaluation shows models trained on our large training sets exhibit emergent generalization capability toward out-of-domain, In-the-Wild, and an unforeseen newly-published TTS systems. Our system reduced baseline equal error rate (EER) by over an order of magnitude on unforeseen TTS data test.
Loading