Abstract: Abstract. A text-to-music (TTM) model should synthesize audio that
reflects the concepts in a given prompt as long as it has been trained
on those concepts. If a prompt references concepts that the TTM model
has not been trained on then the audio it synthesizes will likely not
match. This paper investigates the application of a simple gradient-based
approach called textual inversion (TI) to expand the concept vocabulary
of a trained TTM model without compromising the fidelity of concepts on
which it has already been trained. We apply this technique to MusicGen
and measure its reconstruction and editability quality, as well as its
subjective quality. We see TI can expand the concept vocabulary of a
pretrained TTM model, thus making it personalized and more controllable
without having to finetune the entire model.
Loading