Keywords: Audio Large Language Models, Multimodal Large Language Models, Music Understanding, Benchmarking and Evaluation, Schema-Guided Reasoning, LogicLM
TL;DR: We test mLLMs on core music tasks, finding a strong model×modality effect, where MIDI is near-ceiling but audio perception remains limited.
Abstract: Multimodal Large Language Models (MLLMs) claim “musical understanding,” yet most evaluations conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring (rhythm perception), Transposition Detection (melody perception), and Chord Quality Identification (harmony perception). Moreover, we separate three sources of variability: (i) perceptual limitations (by contrasting audio recordings vs. symbolic MIDI inputs), (ii) exposure to prior examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, Chain of Thought, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning. In LogicLM, LLMs act as perceptual formulators, generating strict, machine-checkable schemas (onset grids, interval sequences) that deterministic solvers execute with self-refinement. Our results reveal a clear perceptual gap: models perform near ceiling on MIDI but show substantial accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Transposition yields the highest accuracies across models, while Chord Identification scores slightly below Syncopation. Overall, current systems reason well over symbols (MIDI) but do not yet “listen” reliably from audio, with reasoning strategies having little impact over accuracy. Our method and dataset make the perception–reasoning boundary explicit and offer actionable guidance for building robust, audio music systems.
Supplementary Material: pdf
Cameraready Material: zip
Submission Number: 35
Loading