Abstract: As an efficient spoofing attack, pitch scaling can conceal speakers’ identities at a low cost by raising or lowering the pitch in a linear style, making it harmful to the security of speaker identification systems. Several studies have explored the possibility of recognizing pitch-shifted voices and also tried to restore them to their original versions. However, they primarily focus on improving the recognition process, while neglecting the restoration process, resulting in poor acoustic quality in the restored voice. In this paper, we propose a unified framework (UniVR) that includes a transformer-based estimator for recognizing pitch-shifted voices and a specially designed voice restoration network for restoring them in high quality. Experiments on AISHELL-1 and AISHELL-3 datasets demonstrate that UniVR achieves state-of-the-art results compared to current anti-pitch-scaling methods, in terms of both recognition accuracy and restoration quality of pitch-shifted voices. Moreover, UniVR can protect current speaker identification systems from pitch-scaling attacks.
Loading