D4AM: A General Denoising Framework for Downstream Acoustic Models


Abstract: The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a unified noise robustness strategy to serve automatic speech recognition (ASR) systems with various setups. However, the training objectives of the existing SE approaches do not consider the generalization ability toward unseen ASR systems. In this study, we propose a general denoising framework for various downstream acoustic models (D4AM). Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. At the same time, our method aims to take the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with the regression and classification objectives, D4AM uses an adjustment scheme to directly estimates suitable weighting coefficients instead of going through a grid search process with additional training costs. The adjustment scheme consists of two parts, namely gradient calibration and regression objective weighting. Experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. To the best of our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general denoising pre-processor applicable to various unseen ASR systems.

Demonstration of enhanced results on the CHIME-4 real sets

In Section 4.3, we have evaluated the enhanced speech signal using an objective perceptual metric, DNSMOS. The results that higher performance on recognition ability accompanies higher performance on the perception of human hearing indicates evidence of our conjecture: any critical point of the classification objective should be ``covered'' by the critical point space of the regression objective. Here, we listed some samples according to different noise types for demonstration.
DT05_BUS_REAL
DT05_CAF_REAL
DT05_PED_REAL
DT05_STR_REAL
ET05_BUS_REAL
ET05_CAF_REAL
ET05_PED_REAL
ET05_STR_REAL