Lightweight Clustering of Cepstral Features for Deepfake Audio Detection

Published: 08 Mar 2026, Last Modified: 08 Mar 2026ICCSIC 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 3: AI Security, Privacy, and Adversarial Defenses
Keywords: audio deepfake, unsupervised learning, handcratfted features, UMAP, K-Means, ASVspoof
TL;DR: Fully unsupervised K-means clustering of static MFCC reveals clusters that are >97 % spoof on ASVspoof 2021 DF, offering lightweight early deepfake rejection.
Abstract: This paper presents an exploratory, fully unsupervised analysis of how short-term cepstral representations cluster bona fide (legitimate) and spoofed (illegitimate) speech samples from the ASVspoof 2021 DF corpus. The analysis employs MFCC, MFCC $\Delta$, MFCC $\Delta\Delta$, LFCC, and CQCC features, constructing utterance-level vectors by aggregating frame-wise coefficients using means and standard deviations. Subsequently, the K-means clustering algorithm is applied, with the number of clusters (k) predetermined by the Elbow method. Cluster quality is evaluated using both intrinsic and extrinsic metrics, along with per-cluster analysis of homogeneity and Shannon entropy. Two-dimensional visualizations are generated via UMAP dimensionality reduction. The results reveal that CQCC achieves the best overall separation between bona fide and spoofed speech, whereas MFCC produces several clusters that are almost exclusively composed of spoofed samples. In contrast, the dynamic coefficients ($\Delta\Delta$ and $\Delta$) degrade the cluster structure. These findings demonstrate that lightweight, unsupervised cepstral front-ends can uncover meaningful spoofing attack patterns and emerge as promising candidates for pre-filtering or routing stages in anti-spoofing pipelines. Nevertheless, their actual impact on supervised performance metrics, such as Equal Error Rate, still warrants robust investigation in future work.
Submission Number: 2
Loading