Abstract: With the rapid development of audio deepfake technology, the credibility and authenticity of public opinion is facing a formidable challenge. Since vocoder is the key component of audio deepfake and leaves distinctive fingerprint features, we propose VFD-Net (Vocoder Fingerprints Detection Net), a new vocoder architectures attribution scheme, which is based on patch-wise supervised contrastive learning (PCL) to capture the global consistency of the vocoder fingerprints and to improve the detection performance in cross-set testing and audio compression scenario. PCL brings patches belonging to the same vocoder class closer together in the representation space, while pushing patches from different vocoder classes further apart. Comparative experimental results show that the average accuracy of our proposed outperforms state-of-the-art 30%-45% under cross-set testing and AAC compression circumstances. Furthermore, our proposed approach achieves a 83.67% average accuracy in short-term fake audio detection within one second. It can be used to detect partially fake audio by analyzing the consistency of vocoder fingerprints.
Loading