ENHANCING TEXT-TO-MUSIC GENERATION THROUGH RETRIEVAL-AUGMENTED PROMPT REWRITE

Meiying Ding; Brian McFee; Chenkai Hu; Sunny Yang; Juhua Huang

ENHANCING TEXT-TO-MUSIC GENERATION THROUGH RETRIEVAL-AUGMENTED PROMPT REWRITE

Meiying Ding, Brian McFee, Chenkai Hu, Sunny Yang, Juhua Huang

Published: 08 Sept 2025, Last Modified: 18 Sept 2025LLM4Music @ ISMIR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Inference and Prompt Engineering of Music LLMs, In-Context Learning of Music LLMs

TL;DR: This paper evaluates the extent to which expertise in prompt construction influences the quality of the music generation and proposed a RAG Rewrite system that transforms novice prompts into expert descriptions using CLAP and GPT 3.5.

Abstract: This paper evaluates the extent to which expertise in prompt construction influences the quality of the music generation output. We propose a Retrieval-Augmented Prompt Rewrite system (RAG) that transforms novice prompts into expert descriptions using CLAP. Our method helps preserve user intent and bypass the need for extensive domain training of the user. Given novice-level prompts, participants selected relevant terminologies from top-k most textually or audibly similar MusicCaps captions, which were fed into GPT 3.5 to create succinct, expert-level rewrites. These rewrites were then used to generate music with Stable Audio 2.0. To mitigate anchoring bias toward expert prompts, we implemented a counterbalanced design and conducted a subjective study to evaluate the effectiveness of RAG. We generated rewrites using a traditional LoRA fine-tuning method as our baseline. Participants evaluated the expertness, musicality, production quality and preference of music generated from novice and expert prompts. Both RAG and LoRA rewrites significantly improve music generation across all NLP and subjective metrics, with RAG outperforming LoRA overall. Finally, the subjective results largely align with Meta’s Audiobox Aesthetics metrics.

Submission Number: 1

Loading