Keywords: bioacoustics, audio, auto-encoder, dataset
Abstract: Automated analysis of bioacoustic recordings is essential for monitoring biodiver-
sity and ecosystem health, yet current methods struggle with the complexity of
natural soundscapes and the scarcity of labeled data. We introduce a bioacoustic
Masked Autoencoder (a self-supervised framework) designed to learn robust audio
representations from large-scale, unlabeled recordings. Pretrained on over 15,000
hours of diverse terrestrial and marine audio, our model—a 1B-parameter Vision
Transformer encoder paired with a 500M-parameter decoder—learns represen-
tations that generalize across species and habitats. When evaluated on multiple
bioacoustic benchmarks, our model achieves state-of-the-art performance among
foundation models in both vocalization detection and species classification tasks.
We further demonstrate the benefits of combining supervised and unsupervised
contrastive objectives for species-aware embeddings. Our contributions include:
(1) a large-scale unified dataset of bioacoustic recordings, (2) a pretrained founda-
tion model for bioacoustic analysis (which we call AudioSAM), and (3) evidence
that self-supervised learning enables scalable, label-efficient monitoring of global
biodiversity.
Submission Number: 38
Loading