Keywords: voice conversion, accent conversion, emotion conversion, real-time, zero-shot
TL;DR: We propose StyleStream, the first real-time zero-shot voice style (timbre, accent, emotion) conversion system that achieves state-of-the-art conversion performance.
Abstract: Voice style conversion aims to transform an input utterance to match a target speaker’s timbre, accent, and emotion. A central challenge is disentangling linguistic content from style attributes. While prior work has investigated this disentanglement, the conversion quality remains suboptimal. Moreover, no existing work addresses real-time voice style conversion. To address these limitations, we propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art conversion performance. StyleStream mainly consists of two components: a destylizer, which removes style attributes (timbre, accent and emotion) while retaining linguistic content, and a stylizer, which is a diffusion transformer (DiT) that reintroduces style conditioned on the target speech. Content–style disentanglement is enforced in the destylizer through two mechanisms: (i) automatic speech recognition (ASR) loss that provides text-level supervision, and (ii) a finite scalar quantization (FSQ) module with a compact codebook of size 45, which serves as a strong information bottleneck. The continuous representations preceding the FSQ layer are treated as the content features. By combining chunked-causal attention masking with a non-autoregressive architecture, StyleStream enables real-time voice style conversion with an end-to-end latency of 1 second.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10033
Loading