A Unified hybrid speech-sound generation Framework for Zero-Shot Voice Cloning in Complex Acoustic Scenes

A Unified hybrid speech-sound generation Framework for Zero-Shot Voice Cloning in Complex Acoustic Scenes

ACL ARR 2026 January Submission10872 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Zero-shot Voice Cloning, Compositional Audio Generation, Disentangled Representation

Abstract: Synthesizing complex auditory scenes—where customized voices seamlessly blend with dynamic sound effects—presents a formidable challenge. While recent unified models support diverse audio generation, they falter at synchronous generation with zero-shot timbre cloning due to the limited granularity of text descriptions. Directly bridging this gap creates a fundamental control dilemma: relying solely on abstract text suffers from representation ambiguity, whereas naively injecting acoustic references introduces multi-modal interference, frequently triggering acoustic shortcut where dominant vocal cues override semantic descriptions. To adress these, we propose ChameleAudio, the first unified framework capable of synchronous speech and sound generation while maintaining high-fidelity zero-shot voice cloning. To resolve the shortcut problem, we devise a progressive training strategy. This ordered paradigm prioritizes semantic controllability before refining acoustic details, ensuring the model captures high-level semantic descriptions. Furthermore, to explicitly resolve multi-condition conflicts, we incorporate a Disentangled Flow Matching strategy driven by Independent Condition Masking. By enforcing statistical independence among modalities during training, this mechanism prevents optimization collapse onto the dominant acoustic stream and enables precise multi-directional guidance during inference. Backed by our LLM-driven hybrid data pipeline, ChameleAudio achieves state-of-the-art performance in complex compositional audio generation.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Speech and Multimodality, Generation, Machine Learning for NLP, Resources and Evaluation

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 10872

Loading