Exploiting Spatial Separability for Deep Learning Multichannel Speech Enhancement with an Align-and-Filter NetworkDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: Multichannel speech enhancement, microphone array beamforming, spatial filtering, signal alignment, relative transfer functions
TL;DR: This paper presents an Align-and-Filter network to study spatial separability of sound sources for deep learning multichannel speech enhancement by incorporating relative transfer functions for signal alignment with sequential masking network design.
Abstract: Multichannel speech enhancement (SE) systems separate the target speech from background noise by performing spatial and spectral filtering. The development of multichannel SE has a long history in the signal processing field, where one crucial step is to exploit spatial separability of sound sources by aligning the microphone signals in response to the target speech source prior to further filtering processes. However, most existing deep learning based multichannel SE works have yet to effectively incorporate or emphasize this spatial alignment aspect in the network design – we postulate that it is owing to the lack of suitable datasets with sufficient spatial diversity of the speech sources. In this paper, we highlight this important but often overlooked step in deep learning based multichannel SE, i.e., signal alignment, by introducing an Align-and-Filter network (AFnet) featuring a two-stage sequential masking design. The AFnet estimates two sets of masks, the alignment masks and filtering masks, and multiplies the estimated masks with the respective input signals to each stage sequentially, while leveraging the relative transfer functions (RTFs) for guiding the model to align signals with various speech source locations during training. For exploration purposes, we argue that the popular CHiME-3 multichannel dataset has its own limitation in representing spatially diverse speech data as the speakers were mostly located at the front side, and thereby adopt simulated and real-world measured room impulse responses to generate multichannel recordings where the target speech sources might come from arbitrary directions. Our findings suggest that for spatially diverse speaker scenarios, careful consideration of exploiting spatial characteristics is of great importance for deep learning based multichannel SE especially when the number of microphone gets increased. We show that utilizing the RTFs for signal alignment purposes in the two-stage, sequential masking framework consistently improves the capability of the network to separate the target speech from the noise signals, supporting that spatial separability is being effectively exploited by the proposed model. Our studies advocate for the advantages and significance of considering the signal alignment aspect, a wisdom coming from conventional signal processing, for developing future deep based multichannel SE algorithms to improve enhancement outcomes with positional diverse target speech scenarios.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
39 Replies

Loading