Localizing Fake Segments in Speech

Bowen Zhang, Terence Sim

Published: 2022, Last Modified: 13 Nov 2024ICPR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Accelerated progress in voice cloning technology is making phone scams easier and exposing potential threats to politicians. Previous spoofing detection technology focused more on the fully faked speech. In this work, we create a Partial Synthetic Detection (Psynd) dataset and propose a fake segments localization system of the partially faked speech. Psynd dataset is a multi-speaker English corpus of approximately 13 hours in total at 24kHz sampling rate read English speech injected with synthetic speech. The fake segments are generated by state-of-art multi-speaker text-to-speech models with high similarity to the real speech to be injected. Our fake segments localization system consists of 3 parts: acoustic feature extraction, classification and post-processing. Frame level CQCC features are extracted and forwarded to a spoofing-discriminant ANN to predict real or fake label sequence. Assuming that the fake or real segments in the partially faked speech cannot be shorter than the duration of a phoneme, the labels of extreme short fake or real segments are flipped. We use 1-D IoU to evaluate the localization performance and get the result of 98.58% during the test, much higher than a random guess of $\frac{1}{3}$. We also explore extreme cases like fully faked, fully real and multi-fake-segments speech and degraded partially faked audio. Some benchmark results are presented on this dataset and show that a more robust detector is needed.