Abstract: Expressive dynamics in music performance are subjective and context-dependent, yet most symbolic models treat Dynamics Markings (DMs) as static with fixed MIDI velocities. This paper proposes a method for predicting DMs in piano performance by combining MusicXML score information with performance MIDI data through a novel tokenization scheme and an adapted RoBERTa-based Masked Language Model (MLM). Our approach focuses on contextual aggregated MIDI velocities and corresponding DMs, accounting for subjective interpretations of pianists. Note-level features are serialized and translated into a sequence of tokens to predict both constant (e.g., mp, ff) and non-constant DMs (e.g., crescendo, fp). Evaluation across three expert performance datasets shows that the model effectively learns dynamics transitions from contextual note blocks and generalizes beyond constant markings. This is the first study to model both constant and non-constant dynamics in a unified framework using contextual sequence learning. The results suggest promising applications for expressive music analysis, performance modeling, and computer-assisted music education.
External IDs:doi:10.1109/lsp.2025.3633579
Loading