Spatial-Temporal-Class Attention Network for Acoustic Scene Classification

Xinlei Niu, Charles Patrick Martin

Published: 01 Jan 2022, Last Modified: 15 May 2023ICME 2022Readers: Everyone

Abstract: Acoustic scene classification, where a scene is identified from a sound recording, is a difficult problem that is much less studied than similar problems in computer vision. Re-cent advances in attention-based convolution neural networks (CNNs) can be applied to audio data by operating on two dimensional spectrograms, where frequency and time infor-mation have been separated, rather than a raw audio signal. Typical CNNs have difficulty coping with this problem due to the temporal aspects of acoustic data. In this research we propose a novel and intuitive CNN-based architecture with attention mechanisms called the spatial-temporal-class attention network (STCANet). The STCANet consists of a spatial-temporal attention and a class attention which extracts in-formation along with frequency, temporal, and the class di-mension of spectrograms. In our experiments, the STCANet achieved 75.6%, 95.4%, and 97.0% accuracy on TUT 2018, TAU 2020, and ESC-I0 datasets that are competitive results compared with previous works. Our contributions include this novel network design and a detailed analysis of how attention allows these results to be achieved.

0 Replies