Keywords: speech processing, object detection, deep neural network, sound object, detection and localization, filter bank design
Abstract: Accurately estimating sound sources' temporal location, spatial location and semantic identity label from multi-channel sound raw waveforms is crucial for an agent to understand the 3D environment acoustically. Multiple sounds form a complex waveform mixture in time, frequency and space, so accurately detecting them requires a representation that can achieve high resolutions across all these dimensions. Existing methods fail to do so because they either extract hand-engineered features\,(i.e. STFT, LogMel) that require a great deal of parameter tuning work (i.e. filter length, window size), or propose to learn a single filter bank to process sound waveforms in a single-scale that often leads to a limited time-frequency resolution capability. In this paper, we tackle this issue by proposing to learn a group of parameterized synperiodic filter banks. Each synperiodic filter's length and frequency response are inversely related, hence is capable of maintaining a better time-frequency resolution trade-off. By alternating the periodicity term, we can easily obtain a group of synperiodic filter banks, where each bank differs in its temporal length. Convolution of the proposed filterbanks with the raw waveform helps to achieve multi-scale perception in the time domain. Moreover, applying synperiodic filter bank to recursively process a downsampled waveform enables us to also achieve multi-scale perception in the frequency domain. Benefiting from the advantage of the multi-scale perception in both time and frequency domain, our proposed synperiodic filter bank groups learn a data-dependent time-frequency resolution map. Following the learnable synperiodic filter bank group front-end, we add a Transformer-like backbone with two parallel soft-stitched branches to learn semantic identity label and spatial location representation semi-independently. Experiments on both direction of arrival estimation task and the physical location estimation task shows our framework outperforms existing methods by a large margin. Replacing existing methods' front-end with synperiodic filter bank also helps to improve the performance.
One-sentence Summary: Design novel learnable filter bank to detect sound sources from multi-channel sound raw waveforms
Supplementary Material: zip
12 Replies
Loading