AudioSAM: Automated Annotation of Bioacoustic Soundscapes in the Wild

Published: 02 Oct 2025, Last Modified: 02 Dec 2025NeurIPS 2025 AiForAnimalComms WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: bioacoustics, audio, auto-encoder, dataset
Abstract: Automated analysis of bioacoustic recordings is essential for monitoring biodiver- sity and ecosystem health, yet current methods struggle with the complexity of natural soundscapes and the scarcity of labeled data. We introduce a bioacoustic Masked Autoencoder (a self-supervised framework) designed to learn robust audio representations from large-scale, unlabeled recordings. Pretrained on over 15,000 hours of diverse terrestrial and marine audio, our model—a 1B-parameter Vision Transformer encoder paired with a 500M-parameter decoder—learns represen- tations that generalize across species and habitats. When evaluated on multiple bioacoustic benchmarks, our model achieves state-of-the-art performance among foundation models in both vocalization detection and species classification tasks. We further demonstrate the benefits of combining supervised and unsupervised contrastive objectives for species-aware embeddings. Our contributions include: (1) a large-scale unified dataset of bioacoustic recordings, (2) a pretrained founda- tion model for bioacoustic analysis (which we call AudioSAM), and (3) evidence that self-supervised learning enables scalable, label-efficient monitoring of global biodiversity.
Submission Number: 38
Loading