Keywords: audio representation learning, layer selection, effective rank, activation geometry, parameter-efficient inference, sparse autoencoders, mechanistic interpretability, pretrained audio models
Abstract: Frozen audio encoders are usually reused by taking a final embedding, but useful task evidence often lives in an earlier, lower-cost, or sparser part of the representation hierarchy. We study how useful readout depth varies across pretraining families, using layer probes together with activation geometry, sparse dictionary features, transcoder routes, and intervention tests. Across audio-text, ASR-supervised, masked, denoising, contrastive, speaker-aware, and self-distillation encoders, final-layer extraction loses at least 10 score points in about half of the evaluated encoder-task settings, with the largest gaps reaching 24--38 points. A zero-label selector based on isotropy and effective rank reduces low-resource ASR character error rate in 11 of 12 language-encoder settings, while few-shot probes recover 11--13 points over final-layer extraction on common-depth encoders. Sparse autoencoders and transcoders show when the selected readout is concentrated, distributed, stable, or editable. The result is a conservative, readout-centered view of audio representation reuse: pretraining family is associated with useful depth, effective-rank geometry provides a candidate-layer prior to validate, and sparse/routing analyses record how selected layers store and route task evidence.
Submission Number: 137
Loading