Keywords: zero-shot audio similarity; content identity; single-source audio; spectral–temporal structure; audio embeddings; training-free evaluation; Global Confusion Rate (GCR); Precision@k; embedding geometry; out-of-distribution generalization; cross-source aggregation; Whisper; time–frequency pooling; PCA; bioacoustics; avian perception.
TL;DR: A training-free benchmark that tests whether audio embeddings encode single-source content identity from spectral–temporal structure across diverse sources.
Abstract: The goal of general-purpose audio representations is to map acoustically variable instances of the same event to nearby points, i.e., to resolve content identity in a zero-shot setting. We introduce VocSim, a training-free benchmark that measures this capability directly on 125k single-source clips aggregated from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, VocSim isolates content representation from source separation confounds. We evaluate embeddings with two training-free measures: local Precision@k and a point-wise Global Separation Rate (GSR) that contrasts each item’s nearest inter-class distance with its mean intra-class distance. To calibrate GSR, we report lift over an empirical random baseline obtained by label permutation.
Across diverse models, a simple pipeline—frozen Whisper encoder features with time–frequency pooling and label-free PCA—yields strong zero-shot performance. Yet VocSim also surfaces a consistent generalization gap: on blind, low-resource speech, local retrieval (P@k) drops sharply and the GSR lift over baseline is small, indicating that global class structure is only marginally better than chance. As external validation, top embeddings predict zebra finch perceptual similarity (80.9\% triplet accuracy) and improve downstream bioacoustic classification. We release data, code, and a public leaderboard to standardize evaluation of zero-shot audio similarity and to catalyze representations that better generalize across sound sources and recording conditions.
Primary Area: datasets and benchmarks
Submission Number: 18177
Loading