Keywords: Speech Enhancement, Diffusion Models, Exponential Averaging, Speech Regeneration, Phoneme Awareness
Abstract: Speech enhancement (SE) improves the robustness of downstream speech technologies under noisy conditions. Self-supervised models such as Wav2Vec 2.0 produce robust frame-level representations that capture phonetic information, but directly conditioning on noisy embeddings can propagate errors. In this work, we propose a temporal abstraction strategy that applies exponential smoothing to Wav2Vec 2.0 embeddings before conditioning a diffusion-based SE network via FiLM modulation. This approach reduces sensitivity to noise. We evaluate the method on a noisy dataset and demonstrate improvements in PESQ and STOI across multiple SNRs and model configurations. Ablations show that exponential smoothing outperforms both naive averaging and unconditioned diffusion models by a PESQ score of 0.35.
Paper Type: Short
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: automatic speech recognition; speech technologies; spoken dialog; spoken language grounding; speech and vision; spoken language translation; spoken language understanding; QA via spoken queries;
Languages Studied: English
Submission Number: 3843
Loading