WaveSpect: A Hybrid Approach to Synthetic Audio Detection via Waveform and Spectrogram Analysis

Dong Chen, Fan Huang, Zhengxuan Song, Wei Zhu, Yin Yang, Kun Zeng

Published: 2025, Last Modified: 05 Jan 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the rapid advancement of synthetic speech technology, the challenges posed by audio deepfakes have become increasingly severe. Despite notable progress in synthetic speech detection, existing algorithms exhibit limited generalization to unknown attacks. To address these challenges, we propose WaveSpect, which combines waveform and spectrogram features to capture subtle artifacts typically overlooked by single-feature methods. By integrating these complementary features, WaveSpect provides more discriminative information. Additionally, we constructed the Chinese Fake Speech Dataset (CFSD), which consists of synthetic speech generated by eight state-of-the-art speech synthesis technologies, to evaluate the generalization ability of our model. Experimental results demonstrate that WaveSpect achieves an equal error rate (EER) of 0.15% and a minimum detection cost function (min t-DCF) of 0.0048 on the ASVspoof2019 LA dataset and an equal error rate (EER) of 0.14% on the CFSD dataset, outperforming all existing single models. These results highlight the superior performance of the WaveSpect architecture in synthetic speech detection.

External IDs:dblp:conf/icassp/ChenHSZYZ25