ASiT-CRNN: A method for sound event detection with fine-tuning of self-supervised pre-trained ASiT-based model

Published: 01 Jan 2025, Last Modified: 16 May 2025Digit. Signal Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, the utilization of pre-trained models to transfer knowledge to downstream tasks has become an increasing trend. In this paper, we present an effective sound event detection (SED) method, which improves the performance of a baseline system based on convolutional recurrent neural network (CRNN) for the DCASE 2022 Task 4 by embedding a local-global audio spectrogram vision transformer (ASiT) with a two-phase fine-tuning strategy, thus referred to as ASiT-CRNN. ASiT is based on the audio classification task and is a pre-trained model on the large-scale audio dataset AudioSet using several self-supervised learning (SSL) methods. However, due to the differences between the clip-level and frame-level tasks, inputting the output of ASiT into the SED task without any processing does not give the desired results. Therefore, the ASiT-CRNN model adopts frequency-averaged pooling (FAP) and nearest neighbour interpolation (NNI) operations on the ASiT output based on the original network architecture, aiming at obtaining a sequence of frame-level features and improving the temporal resolution of the embedding. Two complementary feature sequences, ASiT and CNN, are also fused to obtain a higher quality and more discriminative representation of audio features. We train the ASiT-CRNN model on the development set of DCASE 2022 Task 4 and fine-tune it in two phases using semi-supervised mean-teacher methods to address the challenges of limited labelled data. Finally, we also fairly compare ASiT with several other self-supervised pre-trained models on the SED task. The ASiT-CRNN model achieves PSDS1 and PSDS2 scores of 0.488 and 0.767, respectively, performing significantly better than those of the baseline system, CRNN, of 0.351 and 0.552. In addition, ASiT-CRNN outperforms several other SSL pre-trained models based on SED tasks in the comparison experiments. Open source code at https://github.com/qingkezyy/ASiT-CRNN.
Loading