Keywords: audio deepfake, text-to-speech, detection, attribution, benchmark
Abstract: Modern text-to-speech (TTS) models increasingly rely on foundation pretraining followed by post-training adaptation, creating new challenges for audio deepfake detection and attribution in the wild.
Prior benchmarks mainly test against fixed generators and thus under-estimate the impact of adaptation-induced shifts.
We present GenTrace, a benchmark that tracks TTS evolution from foundation pretraining to diverse adaptation strategies, with controlled prompts and speakers to isolate model-induced differences (16 variants, 49,728 synthesized utterances).
Using GenTrace, we find that alignment-based adaptation typically preserves detection accuracy, while architecture and pretraining data have a substantially larger effect on attribution performance.
GenTrace supports reproducible evaluation of detection and attribution robustness under realistic model adaptation scenarios. GenTrace will be publicly released upon acceptance.
Paper Type: Short
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: Data resources, Data analysis
Languages Studied: English, Chinese
Submission Number: 5319
Loading