You Are What You Train: Rethinking Training Data Quality, Targets, and Architectures for Universal Speech Enhancement
Keywords: Universal Speech Enhancement, Speech Generation, Fidelity and Quality, Training Data Quality
Abstract: Universal Speech Enhancement (USE) aims to restore the quality of diverse degraded speech while preserving fidelity. Despite recent progress, several challenges remain. In this paper, we address three key issues. (1) In speech dereverberation, the conventional use of early-reflected speech as the training target simplifies model training, but we found that it still harms perceptual quality. We therefore apply time-shifted anechoic clean speech as a simple yet more effective target. (2) Regression models preserve fidelity but produce over-smoothed outputs under severe degradation, while generative models improve perceptual quality but risk hallucination. We provide theoretical analysis and introduce a two-stage framework that effectively combines the strengths of both approaches. (3) We study the trade-off between training data scale and quality, a critical factor when scaling to large, imperfect corpora. Experimental results demonstrate that using time-shifted anechoic clean speech as the learning target significantly improves both speech quality and downstream automatic speech recognition (ASR) performance, while the two-stage framework further boosts quality without compromising fidelity. In addition, our model demonstrates strong language-agnostic capability, making it well-suited for enhancing training data in other speech generative tasks. To ensure reproducibility, the code will be made publicly available
upon acceptance of the paper. Several enhanced real noisy speech examples are provided on the demo page: \url{https://anonymous.4open.science/w/USE-5232/}
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13267
Loading