Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

ICLR 2026 Conference Submission20520 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: audio self-supervised learning, probing, frozen embeddings, bioacoustics
TL;DR: This paper investigates the poor performance of probing in multi-label audio, attributing it to a pooling bottleneck rather than deficient features.
Abstract: Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in multi-label audio. This weakness is rooted in the mismatch between the pretraining objective (operating globally) and the downstream task (localized events). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate the global pooling bottleneck. We then introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20520
Loading