End-To-End Streaming Model For Low-Latency Speech Anonymization

Published: 01 Jan 2024, Last Modified: 27 Mar 2025SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230 ms, and a lite version (0.1 x in size) that further reduces latency to 66 ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview