Exponential Smoothing of Noisy Embeddings for Phoneme-Aware Speech Enhancement

Exponential Smoothing of Noisy Embeddings for Phoneme-Aware Speech Enhancement

ACL ARR 2026 January Submission3843 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Enhancement, Diffusion Models, Exponential Averaging, Speech Regeneration, Phoneme Awareness

Abstract: Speech enhancement (SE) improves the robustness of downstream speech technologies under noisy conditions. Self-supervised models such as Wav2Vec 2.0 produce robust frame-level representations that capture phonetic information, but directly conditioning on noisy embeddings can propagate errors. In this work, we propose a temporal abstraction strategy that applies exponential smoothing to Wav2Vec 2.0 embeddings before conditioning a diffusion-based SE network via FiLM modulation. This approach reduces sensitivity to noise. We evaluate the method on a noisy dataset and demonstrate improvements in PESQ and STOI across multiple SNRs and model configurations. Ablations show that exponential smoothing outperforms both naive averaging and unconditioned diffusion models by a PESQ score of 0.35.

Paper Type: Short

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: automatic speech recognition; speech technologies; spoken dialog; spoken language grounding; speech and vision; spoken language translation; spoken language understanding; QA via spoken queries;

Languages Studied: English

Submission Number: 3843

Loading