Keywords: Zero-Shot Voice Cloning, Music Generation, Instruction Following, Audio Generation
Abstract: Existing text-to-music (TTM) models rarely address voice cloning for singing, and most mainstream approaches do not support cross-domain voice cloning based on reference speech. However, speech-to-music voice transfer is highly valuable in practical applications, as speech data is easier to collect and provides more stable speaker representations.In this paper, we propose S2M-Inject, a cross-domain music generation framework that enables voice cloning based on reference speech. By injecting speaker representations extracted from speech into the music generation process, S2M-Inject produces music that preserves the target voice characteristics of the input speech. Experimental results demonstrate that S2M-Inject can effectively perform cross-domain voice cloning while maintaining reasonable music generation quality, and supports both Chinese and English music generation.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Multimodal Diffusion Transformer, Song Generation
Contribution Types: Model analysis & interpretability
Languages Studied: English,Chinese
Submission Number: 975
Loading