MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Fang-Duo Tsai; Shih-Lun Wu; Weijaw Lee; Sheng-Ping Yang; Bo-Rui Chen; Hao-Chung Cheng; Yi-Hsuan Yang

MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: A framework that employs rotary positional embeddings in decoupled cross-attention layers to achieve precise controllability for musical attribute conditioning, as well as audio inpainting and outpainting.

Abstract: We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainable parameters. Source code, model checkpoints, and demo examples are available at: https://MuseControlLite.github.io/web/

Lay Summary: MuseControlLite is a fully open-source, controllable text-to-music model designed for low-cost training. It supports precise control over melody, rhythm, and dynamics, as well as audio inpainting and outpainting, and allows flexible combinations of these conditions. MuseControlLite achieves state-of-the-art performance in melody-conditioned music generation tasks.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/fundwotsai2001/MuseControlLite

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Sound, Artificial Intelligence, Machine Learning, Audio and Speech Processing

Submission Number: 15022

Loading