Supervised Representation Learning for Audio Scene Classification

Alain Rakotomamonjy

Published: 2017, Last Modified: 12 May 2023IEEE ACM Trans. Audio Speech Lang. Process. 2017Readers: Everyone

Abstract: This paper investigates the use of supervised feature learning approaches for extracting relevant and discriminative features from acoustic scene recordings. Owing to the recent release of open datasets for acoustic scene classification problems, representation learning techniques can now be envisioned for solving the problem of feature extraction. This paper makes a step toward this goal by first introducing a supervised nonnegative matrix factorization (SNMF). Our goal through this SNMF is to induce the matrix decomposition to carry out discriminative information in addition to the usual generative ones. We achieve this objective by augmenting the nonnegative matrix factorization optimization problem with a novel loss function related to class labels of each column of the matrix to decompose. While the scale of the datasets available is still small compared to those available in computer vision, we have studied models based on convolutional neural networks. We have analyzed the performances of these models on the DCASE-16 dataset and a corrected version of the LITIS Rouen one. Our experiments show that despite the small-scale setting, supervised feature learning is favorably competitive compared to the current state-of-the-art features. We also point out that for smaller scale dataset, SNMF is indeed slightly less prone to overfitting than convolutional neural networks. While the performances of these learned features are interesting per se, a deeper analysis of their behavior in the acoustic scene problem context raises open and difficult questions that we believe, need to be addressed for further performance breakthroughs.

0 Replies