Controllable Music Loops Generation with MIDI and Text via Multi-Stage Cross Attention and Instrument-Aware Reinforcement Learning
Abstract: The burgeoning field of text-to-music generation models has shown great promise in their ability to generate high-quality music aligned with users' textual descriptions. These models effectively capture abstract/global musical features such as style and mood. However, they often inadequately produce the precise rendering of critical music loop attributes, including melody, rhythms, and instrumentation, which are essential for modern music loop production. To overcome this limitation, this paper proposed a Loops Transformer and a Multi-Stage Cross Attention mechanism that enable a cohesive integration of textual and MIDI input specifications. Additionally, a novel Instrument-Aware Reinforcement Learning technique was introduced to ensure the correct adoption of instrumentation. We demonstrated that the proposed model can generate music loops that simultaneously satisfy the conditions specified by both natural language texts and MIDI input, ensuring coherence between the two modalities. We also showed that our model outperformed the state-of-the-art baseline model, MusicGen, in both objective metrics (by lowering the FAD score by 1.3, indicating superior quality with lower scores, and by improving the Normalized Dynamic Time Warping Distance with given melodies by 12\%) and subjective metrics (by +2.56\% in OVL, +5.42\% in REL, and +7.74\% in Loop Consistency). These improvements highlight our model's capability to produce musically coherent loops that satisfy the complex requirements of contemporary music production, representing a notable advancement in the field. Generated music loop samples can be explored at: https://loopstransformer.netlify.app/ .
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Generative Multimedia, [Generation] Multimedia Foundation Models, [Content] Multimodal Fusion
Relevance To Conference: The Loops Transformer model introduces a novel approach to generating high-quality music loops that cohesively integrate text and MIDI inputs. By employing a Multi-Stage Cross Attention mechanism and an improved Codebook Interleaving Pattern with Loop Shift data augmentation, the model ensures seamless and musically consistent music loop generation. Additionally, the proposed Instrument-Aware Reinforcement Learning strategy enhances the model's ability to generate loops with instruments matching the textual prompts. Extensive experiments demonstrate that the Loops Transformer significantly outperforms state-of-the-art baselines in generating relevant and consistent music loops, as validated by both objective and subjective metrics. Furthermore, comprehensive ablation studies analyze the impact of each key component, highlighting their individual contributions to the model's overall performance. This work pushes the boundaries of multimedia and multimodal processing by enabling the generation of sophisticated music loops that align with user-specified text and MIDI conditions, opening up new possibilities for personalized and interactive music production.
Submission Number: 2946
Loading