Abstract: Neural networks for speech separation generally exhibit high computational costs and large memory footprints. Moreover, typical separation networks have a fixed computational graph that processes all input frames at a uniform computational cost, even though intensive processing may not be necessary for frames containing silence or a single active speaker. Addressing this computational inefficiency is especially crucial when these networks are deployed on resource-constrained devices. In this letter, we propose a dynamic slimmable network for speech separation that mitigates the computational inefficiency of existing networks. We introduce slimmable layers with a gating mechanism that can adapt their computational complexity based on the input characteristics. As an example, we propose to use the slimmable layers in the intra-chunk blocks of a dual-path structure-based network to facilitate adaptation based on the local characteristics of the input signal. Experimental evaluation on simulated two-speaker mixtures from the WSJ0-2mix dataset demonstrates that the proposed method substantially reduces the computational cost while maintaining comparable performance to fully utilized static networks.
External IDs:dblp:journals/spl/ElminshawiCH24
Loading