Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal Teacher

Published: 01 Jan 2024, Last Modified: 06 Aug 2024ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Target Sound Extraction (TSE) is a technique for extracting sound events belonging to a target sound class in a mixture using a Deep Neural Network (DNN). Offline TSE that uses non-causal models has achieved high extraction performance. However, many applications require online processing. Simply converting the non-causal TSE model architecture to a causal one leads to significant performance degradation. To mitigate this problem, we propose using Knowledge Distillation (KD) from a non-causal teacher to a causal student for TSE. In particular, we investigate different options for the non-causal teacher. We identify that a causal network with a non-causal layer normalization provides a strong teacher from which it is easier to transfer knowledge to the student. We conduct experiments with simulated sound mixtures and show that training a causal TSE with the proposed KD scheme can improve the signal-to-distortion ratio (SDR) by 0.9 dB compared to a baseline causal system.
Loading