Keywords: robustness, benchmark, multi-version dataset, music information retrieval, music analysis, automatic music transcription, local key estimation, multi-pitch estimation
TL;DR: We introduce an openly available, multi-version dataset explicitly designed to study robustness in music analysis and transcription.
Abstract: Robustness is a fundamental challenge for deep learning, as models frequently inherit dataset biases and fail to generalize across real-world variability. Models for music audio analysis and transcription—machine-learning tasks of particular difficulty and data scarcity—often lack robustness to changes in instrumentation, interpretation, or recording conditions. In contrast to text and vision, robustness in music remains underexplored. To address this gap, we introduce RUBATO, a manually curated, fully open music dataset and benchmark. Our central idea is to exploit the unique opportunities of Western classical music where we find famous works free of copyright and with an abundance of available recordings, which follow the same score but differ in interpretation and recording conditions, supplemented by arrangements and adaptations for other instrumentations. For RUBATO, we collected and recorded 14 canonical works in up to 54 versions, totaling 560 audio tracks and 42 hours of audio, including original recordings, arrangements and adaptations, controlled piano renditions, and synthesized versions. We further curated symbolic scores and expert annotations for various tasks. Ensuring structural coherence for the majority of versions, we transfer annotations between versions using state-of-the-art alignment techniques, which we evaluate for the heterogeneous version pairs in RUBATO. The resulting high-quality annotations allow for benchmarking music understanding models, which we demonstrate for two selected tasks—automatic music transcription and local key estimation. Going beyond standard metrics, the multi-version design of RUBATO enables systematic evaluation not only of models' efficacy but also of their consistency across versions of the same work. We formalize this notion as cross-version consistency, which allows to assess model robustness along various dimensions of music data. Testing current machine-learning systems for different variants of such consistency measures, we find that most of these systems struggle to generalize under real-world variability, highlighting the need for more robust models and for benchmarks as RUBATO capable of measuring such robustness.
Primary Area: datasets and benchmarks
Submission Number: 11359
Loading