Learning Music Style For Piano Arrangement Through Cross-Modal Bootstrapping

Jingwei Zhao; Gus Xia; Ziyu Wang; Ye Wang

Learning Music Style For Piano Arrangement Through Cross-Modal Bootstrapping

Jingwei Zhao, Gus Xia, Ziyu Wang, Ye Wang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: music generation, audio-to-symbolic alignement, piano cover generation, style transfer, Q-Former

TL;DR: We present a cross-modal framework that enables style-faithful symbolic piano cover arrangement from music audio.

Abstract: What is music style? Though often described using text labels such as "swing," "classical," or "emotional," the real style remains implicit and hidden in concrete music examples. In this paper, we introduce a cross-modal framework that learns implicit music styles from raw audio and applies the styles to symbolic music generation. Inspired by BLIP-2, our model leverages a Querying Transformer (Q-Former) to extract style representations from a large, pre-trained audio language model (LM), and further applies them to condition a symbolic LM for generating piano arrangements. We adopt a two-stage training strategy: contrastive learning to align auditory style with symbolic expression, followed by generative modelling to perform music arrangement. Our model generates piano performances jointly conditioned on a lead sheet (content) and a reference audio example (style), enabling controllable and stylistically faithful arrangement. Experiments demonstrate the effectiveness of our approach in piano cover generation, style transfer, and audio-to-MIDI retrieval, achieving substantial improvements in style-aware alignment and music quality.

Primary Area: generative models

Submission Number: 24894

Loading