DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Published: 01 Jan 2022, Last Modified: 26 Jan 2025WWW 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: While vast amounts of personal data are shared daily on public online platforms and used by companies and analysts to gain valuable insights, privacy concerns are also on the rise: Modern authorship attribution techniques have proven effective at identifying individuals from their data, such as their writing style or behavior of picking and judging movies. It is hence crucial to develop data sanitization methods that allow sharing of users’ data while protecting their privacy and preserving quality and content of the original data.In this paper, we tackle anonymization of textual data and propose an end-to-end differentially private variational autoencoder architecture. Unlike previous approaches that achieve differential privacy on a per-word level through individual perturbations, our solution works at an abstract level by perturbing the latent vectors that provide a global summary of the input texts. Decoding an obfuscated latent vector thus not only allows our model to produce coherent, high-quality output text that is human-readable, but also results in strong anonymization due to the diversity of the produced data. We evaluate our approach on IMDb movie and Yelp business reviews, confirming its anonymization capabilities and preservation of the semantics and utility of the original sentences.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview