Keywords: Audio Source Separation, Perceptual Quality Assessment, Uncertainty Quantification, Self‑Supervised Representation, Manifold Learning
TL;DR: We introduce two granular measures that quantify interference from competing talkers and distortion of the target signal in audio source separation, along with their error bounds.
Abstract: Objective assessment of audio source‑separation systems still mismatches subjective human perception, especially when interference from competing talkers and distortion of the target signal interact. We introduce Perceptual Separation (PS) and Perceptual Match (PM), a complementary pair of measures that, by design, isolate these leakage and distortion factors.
Our intrusive approach generates a set of fundamental distortions, e.g., clipping, notch filter, and pitch shift from each reference waveform signal in the mixture. Distortions, references, and system outputs from all sources are independently encoded by a pre-trained self-supervised model, then aggregated and embedded with a manifold learning technique called diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveform representations.
On this manifold, PM captures the self‑distortion of a source by measuring distances from its output to its reference and associated distortions, while PS captures leakage by also accounting for distances from the output to non‑attributed references and distortions.
Both measures are differentiable and operate at a resolution as high as 75 frames per second, allowing granular optimization and analysis.
We further derive, for both measures, frame-level deterministic error radius and non-asymptotic, high-probability confidence intervals.
Experiments on English, Spanish, and music mixtures show that, against 14 widely used measures, the PS and PM are almost always placed first or second in linear and rank correlations with subjective human mean-opinion scores.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16571
Loading