A Deep Residual Network for Large-Scale Acoustic Scene Analysis

Logan Ford, Hao Tang, François Grondin, James R. Glass

Published: 2019, Last Modified: 28 Apr 2023INTERSPEECH 2019Readers: Everyone

Abstract: Many of the recent advances in audio event detection, particularly on the AudioSet data set, have focused on improving performance using the released embeddings produced by a pre-trained model. In this work, we instead study the task of training a multi-label event classifier directly from the audio recordings of AudioSet. Using the audio recordings, not only are we able to reproduce results from prior work, we have also confirmed improvements of other proposed additions, such as an attention module. Moreover, by training the embedding network jointly with the additions, we achieve an mAP of 0.392 and an AUC of 0.971, surpassing the state of the art without transfer learning from a large data set. We also analyze the output activations of the network and find that the models are able to localize audio events when a finer time resolution is needed.

0 Replies