Audio Scene Classification with Discriminatively-Trained Segment-Level Features

Haichuan Bai, Hangting Chen, Yonghong Yan

Published: 2019, Last Modified: 15 May 2025ICME Workshops 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we investigated a novel method that automatically classifies live audio recordings into the predefined acoustic scenes, by making use of discriminatively-trained segment-level features. Long-term characteristics of the acoustic scenes are captured in a feed-forward time-delay deep neural network with the temporal pooling layer which aggregates over the whole audio segment. The discriminatively-trained segment-level audio features derived from this network are concatenated to the frame-level features, and fed into the DNN-based back-end classifier. Then the post-processing mechanism is applied. Experiment results demonstrate that the proposed system with the discriminatively-trained segment-level features achieves the classification accuracy of 75.68%, and the absolute improvement of 13.64% is gain in comparison with the referential systems only using the frame-level features.