Towards Multimodal Understanding of Music Scores and Performance Audio

Towards Multimodal Understanding of Music Scores and Performance Audio

ICLR 2026 Conference Submission17059 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models; Music Understanding; Benchmarking; Retrieval-Augmented Generation; Optical Music Recognition

Abstract: Music theory, scores, and performance audio are central modalities in music research, carrying rich information about melody, harmony, rhythm, and expressive interpretation. Yet, current multimodal large language models (MLLMs) struggle to reason jointly over symbolic and acoustic inputs, particularly when dealing with high-resolution scores and fine-grained performance signals. We introduce MuseBench, the first benchmark designed to evaluate MLLMs across three key dimensions of music understanding: (1) fundamental theory knowledge, (2) score-based reasoning, and (3) performance-level interpretation. To address these challenges, we further present MuseAgent, a multimodal retrieval-augmented large language model framework. MuseAgent employs two specialized perceptual modules: measure-wise optical music recognition (M-OMR) for sheet images and automatic music transcription (AMT) for performance audio. These modules unify heterogeneous modalities into structured textual representations (e.g., ABC notation, MusicXML, JSON), which can then be directly consumed by an LLM. A database retrieval module enables both explicit retrieval (user-driven) and implicit retrieval (agent-triggered) from symbolic and audio libraries, while also serving as a storage layer for structured music. Combined with a lightweight memory bank, MuseAgent supports multi-turn, interactive orchestration of modules according to user intent. Extensive evaluations on MuseBench show that MuseAgent outperforms general-purpose MLLMs in symbolic and performance-level reasoning, demonstrating the effectiveness of combining structured multimodal representations, retrieval/storage, and agent-based orchestration.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 17059

Loading