Keywords: Sound Crowd Count, Dyadic Decomposition Network, Learnable Filters, Acoustic Crowd Counting
Abstract: In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sound events in data characterized by a high degree of polyphonicity and spectral overlap. A key example is counting individual bird calls in bioacoustic data, from which biodiversity can be estimated. We do so by systematically proposing a novel end-to-end trainable neural network, designing new evaluation protocols, quantifying the difficulty of counting depending on sound polyphonicity, and creating a new dataset tailored for concurrent sound event counting. Unlike existing methods that all apply frequency-selective filters on the raw waveform in a one-stage manner, our neural network progressively decomposes the raw waveform dyadically in frequency domain. Taking inspiration from wavelet decomposition, intermediate waveforms convolved by a parent filter are successively processed by a pair of children filters that evenly split the parent filter's carried frequency response. An energy gain normalization module is introduced to normalize received sound events' loudness variance and spectrum overlap. The network is fully convolutional and parameter-frugal so it is light-weight and computationally efficient. We further design a set of polyphony-aware metrics to quantify sound counting difficulty level from different perspectives. To show the efficiency and generalization of our method (we call DyDecNet), we do experiments on both bioacoustic bird sound (both synthetic and real-world sound), telephone-ring sound and music sound data. Comprehensive experiment results show our method outperforms existing sound event detection (SED) methods significantly. The dyadic decomposition front-end network can be used by existing methods to improve their performance accordingly.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: A novel and general framework for sound crowd counting from sound raw waveform
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/soundcount-sound-counting-from-raw-audio-with/code)
7 Replies
Loading