Keywords: audio deepfake, spoofed audio detection, Artificial Intelligence, linguistics, sociolinguistics, linguistic perception
TL;DR: Linguistic module to add explainability and speed to audio deepfake detectors at scale
Abstract: Artificial Intelligence (AI)-generated content (deepfake content) is considered to be a major threat that can lead to fraud and the spread of incorrect information. The focus of generative AI research has largely been on advancements in content generation, with limited attention given to detection of AI generated content, particularly for AI generated audio content. Although foundation models offer powerful representations for detecting spoofed audio, they suffer from limitations in explainability and slow extraction speeds, which affect scalability. Prior research has integrated sociolinguistic expertise to identify phonetic and phonological cues in spoken English for spoofed audio detection. This approach, while successful, was limited in scale as it relied on the labeling of linguistic features by domain experts. In this paper we propose a novel model to auto-label expert-informed phonetic and phonological cues through deep-learning based representations fine tuned with domain expert input. As such, using the fine-tuning method with expert-informed features, we scale this interdisciplinary approach and demonstrate its benefits in enhancing explainability and optimizing resource utilization (consumed time) for utilizing foundation models in large-scale applications. For example, when considering XLSR-Wav2vec-ResNet18 as one of the most recent baselines, findings indicate that our method has decreased the Equal Error Rate of the baseline model in audio deepfake detection with at least 7\% (effectiveness) across a subset of ASVspoof5 dataset. In our proposed cost efficient ensemble setup, we have 31\% time reduction in audio deepfake detection (scalability). Additionally, the algorithmically encoded linguistic features enhance the explainability via reverse engineering (explainability). Our proposed method is a multi-view approach as it takes advantage of not only deep representations, but also human expert-informed phonetic and phonological aspects of natural speech.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13506
Loading