Abstract: Text-prompt probes of music foundation models
often show high linear separability for semantic attributes, but it
is unclear whether that signal reflects acoustic understanding or
lexical regularities. We investigate one guiding question through-
out this paper: when a MusicCoCa embedding, used in Magenta’s
Realtime generative music model, separates an attribute from text
prompts, does that separation persist for real audio under artist-
disjoint evaluation? To answer this, we compare matched text and
audio probing conditions using 500 MagnaTagATune clips, six
feasible tag families, and 3-fold group cross-validation by artist
identity. We report both raw accuracy and margin over chance
to normalize for different classes. The main result is a consistent
text-audio gap for most semantic attributes, especially instrument
and timbre, while loudness remains strongly separable in both
modalities. This pattern suggests that text-side separability can
overstate acoustic grounding if interpreted without modality-
matched controls. We contribute a reproducible protocol for side-
by-side text versus audio probing and a transparent analysis of
what the probes do and do not justify.
Loading