Universal Semantic Disentangled Privacy-preserving Speech Representation Learning

Biel Tura Vecino; Subhadeep Maji; Aravind Varier; Antonio Bonafonte; Ivan Valles-Perez; Michael Owen; LEIF RÄDEL; Grant Strimel; Oluwaseyi Feyisetan; Roberto Barra-Chicote; Ariya Rastrow; Constantinos Papayiannis; Volker Leutnant; Trevor Wood

Universal Semantic Disentangled Privacy-preserving Speech Representation Learning

Biel Tura Vecino, Subhadeep Maji, Aravind Varier, Antonio Bonafonte, Ivan Valles-Perez, Michael Owen, LEIF RÄDEL, Grant Strimel, Oluwaseyi Feyisetan, Roberto Barra-Chicote, Ariya Rastrow, Constantinos Papayiannis, Volker Leutnant, Trevor Wood

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech, tokenization, disentanglement, large-language models, LLM, privacy, secure

TL;DR: A speech representation learning method that unlocks training privacy-preserving generative speech models by disentangling speech into semantic representations that capture content and paralinguistics, while removing identifiable speaker information.

Abstract: The use of audio recordings of human speech to train LLMs poses privacy concerns due to these models' potential to generate outputs that closely resemble artifacts in the training data. In this study, we propose a speaker privacy-preserving representation learning method through the Universal Speech Codec (USC), a computationally efficient encoder-decoder model that disentangles speech into: (i) privacy-preserving semantically rich representations, capturing content and speech paralinguistics, and (ii) residual acoustic and speaker representations that enables high-fidelity reconstruction. Extensive evaluations presented show that USC's semantic representation preserves content, prosody, and sentiment, while removing potentially identifiable speaker attributes. Combining both representations, USC achieves state-of-the-art speech reconstruction. Additionally, we introduce an evaluation methodology for measuring privacy-preserving properties, aligning with perceptual tests. We compare USC against other codecs in the literature and demonstrate its effectiveness on privacy-preserving representation learning, illustrating the trade-offs of speaker anonymization, paralinguistics retention and content preservation in the learned semantic representations.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3959

Loading