E2E-BPVC: End-to-End Background-Preserving Voice Conversion via In-Context Learning

Yihan Liu, Zhengyang Chen, Leying Zhang, Yanmin Qian

Published: 2025, Last Modified: 12 Mar 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Voice conversion (VC) systems are commonly trained on clean speech and are unable to function properly in the presence of background sound. However, in many instances, the background sound of speech and the context in which the speech occurs are highly semantically relevant and should also be retained. Existing approaches address this issue by introducing a denoising module to separate speech from background sound before applying voice conversion, which increases complexity and may lead to extra distortion. In this paper, we propose an end-to-end background-preserving voice conversion (E2E-BPVC) framework for the first time. By leveraging in-context learning (ICL), our model simultaneously modifies speech timbre and retains background sounds without requiring a separate denoising step. Both objective and subjective evaluations demonstrate that our method achieves performance comparable to denoising-based BPVC frameworks while maintaining a more streamlined and efficient system design.
Loading