A Progressive Generation Framework with Speech Pre-trained Model for Expressive Voice Conversion

Published: 2025, Last Modified: 08 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Expressive voice conversion (EVC) aims to modify the speaker identity and emotional style of speech while preserving its content. Existing approaches often focus on disentangling speaker, emotion, and content information but overlook the progressive generation mechanisms in human speech production. To address this, we propose a three-stage framework that includes a speech disentanglement module, a progressive generator, and an acoustic refiner. This framework enables speech pre-trained models to parse linguistic content, emotional style, and speaker identity, which are then progressively integrated into the speech reconstruction branch to generate high-quality speech with replaceable emotional style and speaker identity. Experiments with six different pre-trained models show that our framework activates their disentanglement capabilities, surpassing baseline performance in EVC, and supports speaker and emotion control from different target samples. This framework also provides a valuable reference for evaluating the disentanglement capabilities of speech pre-training models.
Loading