Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks

Ting-Wei Su, Jen-Yu Liu, Yi-Hsuan Yang

Published: 2017, Last Modified: 28 Jul 2025ICASSP 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audio event detection aims at discovering the elements inside an audio clip. In addition to labeling the clips with the audio events, we want to find out the temporal locations of these events. However, creating clearly annotated training data can be time-consuming. Therefore, we provide a model based on convolutional neural networks that relies only on weakly-supervised data for training. These data can be directly obtained from online platforms, such as Freesound, with the clip-level labels assigned by the uploaders. The structure of our model is extended to a fully convolutional networks, and an event-specific Gaussian filter layer is designed to advance its learning ability. Besides, this model is able to detect frame-level information, e.g., the temporal position of sounds, even when it is trained merely with clip-level labels.