Abstract: Masked Autoencoders (MAEs) learn rich representations in audio classification through an efficient self-supervised reconstruction task. Yet, general-purpose models struggle in fine-grained audio domains such as bird sound classification, which demands distinguishing subtle inter-species differences under high intra-species variability. We show that bridging this domain gap requires full-pipeline adaptation beyond domain-specific pretraining data. Using BirdSet, a large-scale bioacoustic benchmark, we systematically adapt pretraining, fine-tuning, and frozen feature utilization. Our Bird-MAE sets new state-of-the-art results on BirdSet’s multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, which boosts the utility of frozen MAE features by achieving up to 37 mAP points over linear probes and narrowing the gap to fine-tuning in low-resource settings. Bird-MAE also exhibits strong few-shot generalization with prototypical probes on our newly established few-shot benchmark on BirdSet, underscoring the importance of tailored self-supervised learning pipelines for fine-grained audio domains.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/DBD-research-group/Bird-MAE
Assigned Action Editor: ~Chuan-Sheng_Foo1
Submission Number: 5030
Loading