Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation

Published: 14 Nov 2025, Last Modified: 14 Nov 2025EAIM OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vocal Timbral Technique, Voice Conversion
TL;DR: We present FABYOL, the first model to our knowledge that performs timbral technique conversion while preserve the speaker identity.
Abstract: Vocal timbral techniques—such as whisper, falsetto, and vocal fry scream—uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models.
Submission Number: 32
Loading