Abstract: Recently, sound event detection (SED) has made significant advancements through the application of deep learning, but there are still many difficulties and challenges to be addressed. One of the major challenges is the diversity of sound events, leading to substantial variations in time-frequency domain features. Additionally, most existing SED models can not effectively handle sound events of different scales, particularly those of short duration. Another challenge is the lack of well labeled dataset. The commonly used solution is mean teacher method, but inaccurate pseudo-labels could lead to confirmation bias and performance imbalance. In this paper, we introduce the multi-dimensional frequency dynamic convolution, which endows convolutional kernels with frequency-adaptive dynamic properties to enhance the feature representation capability. Moreover, we propose dual self attention pooling function to achieve more precise temporal localization. Finally, to solve the incorrect pseudo-labels problems, we propose the confidence-aware mean teacher to increase pseudo-labels accuracy and train the student model with high confidence labels. Experimental results on DCASE2017, DCASE2018 and DCASE2023 Task4 dataset validate the superior performance of proposed methods.
Loading