Improving Acoustic Scene Classification via Self-Supervised and Semi-Supervised Learning with Efficient Audio Transformer

Yuzhe Liang, Wenxi Chen, Anbai Jiang, Yihong Qiu, Xinhu Zheng, Wen Huang, Bing Han, Yanmin Qian, Pingyi Fan, Wei-Qiang Zhang, L. Cheng, Jia Liu, Xie Chen

Published: 01 Jan 2024, Last Modified: 14 May 2025ICME Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In response to the challenges posed by the abundance of unlabeled acoustic scene data in the real world, along with the domain differences in acoustic scenes, the ICME 2024 Grand Challenge has introduced the task of “Semi-supervised Acoustic Scene Classification under Domain Shift.” To tackle this issue, we propose a multi-stage semi-supervised frame-work that utilizes the self-supervised learning (SSL) model - Efficient Audio Transformer (EAT) and the self-learning fine-tuning method. This framework employs self-supervised learning to train on a wealth of unlabeled acoustic scene data, thereby obtaining the capability of extracting audio representations. It then leverages semi-supervised fine-tuning with pseudo-labels and utilizes a test-time adaptation strategy to optimize inference. Our approach achieved a Macro-accuracy of 0.752 across ten categories on the final evaluation dataset, ranked second, only 0.006 lower than the first-place system.