Probing Audio Understanding in Realtime Generative Music Models

Arjun Bahuguna

Published: 19 Mar 2026, Last Modified: 11 May 2026OpenReview Archive Direct UploadEveryoneCC BY-NC 4.0

Abstract: Text-prompt probes of music foundation models often show high linear separability for semantic attributes, but it is unclear whether that signal reflects acoustic understanding or lexical regularities. We investigate one guiding question through- out this paper: when a MusicCoCa embedding, used in Magenta’s Realtime generative music model, separates an attribute from text prompts, does that separation persist for real audio under artist- disjoint evaluation? To answer this, we compare matched text and audio probing conditions using 500 MagnaTagATune clips, six feasible tag families, and 3-fold group cross-validation by artist identity. We report both raw accuracy and margin over chance to normalize for different classes. The main result is a consistent text-audio gap for most semantic attributes, especially instrument and timbre, while loudness remains strongly separable in both modalities. This pattern suggests that text-side separability can overstate acoustic grounding if interpreted without modality- matched controls. We contribute a reproducible protocol for side- by-side text versus audio probing and a transparent analysis of what the probes do and do not justify.