Explaining Deep Learning Models for Speech Enhancement

Sunit Sivasankaran; Emmanuel Vincent; Dominique Fohr

Explaining Deep Learning Models for Speech Enhancement

Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr

Published: 01 Jan 2021, Last Modified: 30 Sept 2024Interspeech 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We consider the problem of explaining the robustness of neural networks used to compute time-frequency masks for speech enhancement to mismatched noise conditions. We employ the Deep SHapley Additive exPlanations (DeepSHAP) feature attribution method to quantify the contribution of every time-frequency bin in the input noisy speech signal to every time-frequency bin in the output time-frequency mask. We define an objective metric — referred to as the speech relevance score — that summarizes the obtained SHAP values and show that it correlates with the enhancement performance, as measured by the word error rate on the CHiME-4 real evaluation dataset. We use the speech relevance score to explain the generalization ability of three speech enhancement models trained using synthetically generated speech-shaped noise, noise from a professional sound effects library, or real CHiME-4 noise. To the best of our knowledge, this is the first study on neural network explainability in the context of speech enhancement.

Loading