Self-Supervised Learning-Based General Fine-tuning Framework For Audio Classification and Event Detection

Yanjie Sun, Kele Xu, Yong Dou, Tian Gao

Published: 2024, Last Modified: 14 May 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, self-supervised learning (SSL) has made remarkable progress in signal representation and has become a de facto solution for different audio processing tasks. Generally, the SSL consists of the foundation pre-training and downstream fine-tuning phases. However, fine-tuning frameworks may lack universality due to the distinct learning paradigms and model designs employed in audio signal processing tasks. Furthermore, the varying degrees of dataset labeling across different tasks challenge unifying a fine-tuning framework. To address these issues, we propose vec2task, a cross-task general fine-tuning framework based on the SSL pre-trained model. It employs a semantic-aware module and an alternating training strategy, enabling the framework to generalize across various audio signal processing tasks. Additionally, the framework employs automatic audio augmentation strategies, eliminating the requirement for individually tailored algorithms to improve task performance. Experimental validations of the vec2task framework outperformed previous methods in audio classification and event detection tasks, showcasing its generalization ability across tasks.