Applying textual inversion to control and personalize text-to-music models

Carl Thomé, Bob L. T. Sturm, John Pertoft, Nicolas Jonason

Published: 24 Jun 2024, Last Modified: 14 May 2025Proc. 15th Int. Workshop on Machine Learning and Music, 2024EveryoneCC BY 4.0

Abstract: Abstract. A text-to-music (TTM) model should synthesize audio that reflects the concepts in a given prompt as long as it has been trained on those concepts. If a prompt references concepts that the TTM model has not been trained on then the audio it synthesizes will likely not match. This paper investigates the application of a simple gradient-based approach called textual inversion (TI) to expand the concept vocabulary of a trained TTM model without compromising the fidelity of concepts on which it has already been trained. We apply this technique to MusicGen and measure its reconstruction and editability quality, as well as its subjective quality. We see TI can expand the concept vocabulary of a pretrained TTM model, thus making it personalized and more controllable without having to finetune the entire model.