Keywords: music generation, audio-to-symbolic, piano cover generation, style transfer, Q-Former
TL;DR: We present a cross-modal framework that enables style-faithful symbolic piano cover arrangement from music audio.
Abstract: What is music style? Though often described using text labels such as "swing," "classical," or "emotional," the real style remains implicit and hidden in concrete music examples. In this paper, we introduce a cross-modal framework that learns implicit music styles from raw audio and applies the styles to symbolic music generation. Inspired by BLIP-2, our model leverages a Querying Transformer (Q-Former) to extract style representations from a large, pre-trained audio language model (LM), and further applies them to condition a symbolic LM for generating piano arrangements. We adopt a two-stage training strategy: contrastive learning to align style representations with symbolic expression, followed by generative modeling to perform music arrangement. We name our model as BOSSA (i.e., BOotStrapping audio-to-Symbolic Arrangement). It generates piano performances jointly conditioned on a lead sheet (content) and a reference audio example (style), enabling controllable and stylistically faithful arrangement.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
(Optional) Short Video Recording Link: https://youtu.be/yG-wmdz4Ntw?si=rHTIN9SDCHRMKbx9
Submission Number: 89
Loading