Text Embeddings Reveal (Almost) As Much As Text

John Xavier Morris; Volodymyr Kuleshov; Vitaly Shmatikov; Alexander M Rush

Text Embeddings Reveal (Almost) As Much As Text

John Xavier Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M Rush

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: NLP Applications

Submission Track 2: Information Retrieval and Text Mining

Keywords: text embeddings, text retrieval, privacy, inversion, leakage attack

TL;DR: We show that we can reconstruct text from text embeddings and this threat model can be a serious violation of user privacy.

Abstract: How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

Submission Number: 2068

Loading