Spatial Normalization to Reduce Positional Complexity in Direction-aided Supervised Binaural Sound Source Separation

Ryu Takeda, Kazuhiro Nakadai, Kazunori Komatani

Published: 2021, Last Modified: 28 Feb 2026APSIPA ASC 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper describes a novel normalization method for mask-based binaural sound source separation using neural networks (NNs). Given a target source direction, the NNs estimate masks that extract target source components in the time-frequency domain. The numerous patterns of sound source numbers and positions make it difficult to train the NNs because some equivalent patterns are treated as different ones. We therefore propose a spatial normalization method of input signals as a pre-processing of the mask estimation. This normalization can reduce the essential positional complexity by converting the transfer functions of input signals into a canonical form using the target direction. This normalization improves the mask estimation and achieves the optimization of spatial pre-filters. Experiments using mixtures of two, three, and four sources showed that the NNs with our spatial normalization improved the signal-to-distortion ratio by 2.1 dB compared with the NNs without the spatial normalization.