Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE
Abstract: In this paper, we present our solution for ACM Multimedia 2022 Computational Paralinguistics Challenge. Our method employs the self-supervised learning paradigm, as it achieves promising results in computer vision and audio signal processing. Specifically, we firstly explore modifying the Swin Transformer architecture to learn general representation for the audio signals, accompanied with random masking on the log-mel spectrogram. The main goal of the pretext task is to predict the masked parts, by combining the advantages of the Swin-Transformer and masked modeling. For the downstream tasks, we utilize the labelled datasets to fine-tune the pre-trained model. Compared with the competitive baselines, our approach can provide significant performance improvements without ensembling.
0 Replies
Loading