TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

ICLR 2026 Conference Submission22707 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Time-varying timbre, Streaming voice conversion, Content-synchronous speaker conditioning, Speech anonymization, Vector-quantized bottleneck

TL;DR: A streamable voice conversion/anonymization system that synchronizes time-varying timbre with content via a Global Timbre Memory, improving naturalness and privacy under strict low-latency constraints.

Abstract: Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with $<$80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22707

Loading