Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification

Published: 01 Jan 2022, Last Modified: 13 Nov 2024INTERSPEECH 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, the Transformer-based model has been applied to sound event detection and audio classification tasks. However, when processing the audio spectrogram on a fine-grained scale, the computational cost is still high even with a hierarchical structure. In this paper, we introduce APT: an audio pyramid transformer with quadtree attention to reduce the computational complexity from quadratic to linear. Besides, most previous methods for weakly supervised sound event detection (WSSED) utilize the multi-instance learning (MIL) mechanism. However, MIL focuses more on the accuracy of bags (clips) rather than the instances (frames), so it tends to localize the most distinct part but not the whole sound event. To solve this problem, we provide a novel perspective that models WSSED as a domain adaption (DA) task, where the weights of the classifier trained on the source (clip) domain are shared to the target (frame) domain to enhance localization performance. Furthermore, we introduce a DAD (domain adaption detection) loss to align the feature distribution between frame and clip domain and make the classifier perceive frame domain information better. Experiments show that our APT achieves new state-of-the-art (SOTA) results on AudioSet, DCASE2017 and Urban-SED datasets. Moreover, our DA-WSSED pipeline significantly outperforms the MIL-based WSSED method.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview