Prompt-Guided Zero-Shot Voice Cloning for Music Generation

Prompt-Guided Zero-Shot Voice Cloning for Music Generation

ACL ARR 2026 January Submission975 Authors

26 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Zero-Shot Voice Cloning, Music Generation, Instruction Following, Audio Generation

Abstract: Existing text-to-music (TTM) models rarely address voice cloning for singing, and most mainstream approaches do not support cross-domain voice cloning based on reference speech. However, speech-to-music voice transfer is highly valuable in practical applications, as speech data is easier to collect and provides more stable speaker representations.In this paper, we propose S2M-Inject, a cross-domain music generation framework that enables voice cloning based on reference speech. By injecting speaker representations extracted from speech into the music generation process, S2M-Inject produces music that preserves the target voice characteristics of the input speech. Experimental results demonstrate that S2M-Inject can effectively perform cross-domain voice cloning while maintaining reasonable music generation quality, and supports both Chinese and English music generation.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: Multimodal Diffusion Transformer, Song Generation

Contribution Types: Model analysis & interpretability

Languages Studied: English,Chinese

Submission Number: 975

Loading