Abstract: Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement (SE) research as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, since they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use scores estimated from discriminative models in the first steps of the RDP. These discriminative-based scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that choosing an appropriate number of discriminative guidance steps can result in an overall model with better performance than generative and discriminative models. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models.
Loading