Abstract: Recent advancements in singing voice synthesis have significantly improved the quality of artificial singing voices, raising concerns about their potential misuse in generating deepfake singing, or “singfake” voices. Detecting these synthetic voices presents unique challenges due to the complex nature of singing, which involves pitch, timbre, and accompaniment variations. In this study, we conduct a comparative analysis of two model types for singfake detection: (1) models utilizing Log-Mel spectrograms, such as Audio Spectrogram Transformer (AST) and Whisper, and (2) models that process raw waveform inputs, including UniSpeech-SAT and HuBERT. Our experiments on the SingFake dataset evaluate these models under two input conditions—separated vocal tracks and full song mixtures—across different test subsets. The results indicate that spectrogram-based models generally outperform waveform-based models, notably on unseen singers. Metrics such as Precision, Recall, F1-score, Equal Error Rate (EER), and Area Under the Curve (AUC) provide insights into the strengths and weaknesses of each approach. Our findings contribute to developing more effective deepfake singing detection methods, with implications for security, media authentication, and digital content protection.
Loading