Keywords: speech separation, speech enhancement, deep learning, early exit, dynamic neural networks
Abstract: In recent years, deep learning-based single-channel speech separation has improved
considerably, in large part driven by increasingly compute- and parameter-efficient
neural network architectures. Most such architectures are, however, designed with a
fixed compute and parameter budget and consequently cannot scale to varying compute
demands or resources, which limits their use in embedded and heterogeneous
devices such as mobile phones and hearables.
To enable such use-cases we design a neural network architecture for speech separation
and enhancement capable of early-exit, and we propose an uncertainty-aware
probabilistic framework to jointly model the clean speech signal and error variance
which we use to derive probabilistic early-exit conditions in terms of desired
signal-to-noise ratios.
We evaluate our methods on both speech separation and enhancement tasks where we
demonstrate that early-exit capabilities can be introduced without compromising
reconstruction, and that when trained on variable-length audio our early-exit
conditions are well-calibrated and lead to considerable compute savings when used to
dynamically scale compute at test time while remaining directly interpretable.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24252
Loading