Keywords: bioacoustics, audio, auto-encoder, dataset
Abstract: Bioacoustic monitoring is vital for ecological research and wildlife conservation, enabling the tracking of species and ecosystem health without physical intervention. However, automated analysis of bioacoustic data is challenging due to complex soundscapes with diverse species and background noise. This paper introduces the Bioacoustic Masked Autoencoder (Bioacoustic MAE), a self-supervised learning approach tailored to bioacoustic data. We adapt the Masked Autoencoder framework to animal vocalizations by incorporating specialized patching strategies and a spectral loss function to capture critical temporal and harmonic features for species identification. Trained on a diverse dataset of terrestrial and marine bioacoustic recordings, our model learns robust audio representations from large amounts of unlabeled data, significantly outperforming traditional methods in species identification, vocalization detection, and acoustic event classification. We present three main contributions: (1) the release of a 1 billion parameter Vision Transformer (ViT) encoder model, trained on bioacoustic data, capable of handling long audio sequences, (2) a set of unified annotations across dozens of bioacoustic datsets (3) evidence of zero-shot transfer, demonstrating the model’s ability to generalize across unseen species, habitats, and recording conditions. Our approach enables scalable, efficient bioacoustic analysis and advances the field by providing a label-free method to study animal vocalizations across diverse environments.
Submission Number: 38
Loading