Robust Utility-Preserving Text Anonymization Based on Large Language Models

Robust Utility-Preserving Text Anonymization Based on Large Language Models

ACL ARR 2025 February Submission1152 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Anonymizing text that contains sensitive information is crucial for a wide range of applications. Existing techniques face the emerging challenges of the re-identification ability of large language models (LLMs), which have shown advanced capability in memorizing detailed information and reasoning over dispersed pieces of patterns to draw conclusions. When defending against LLM-based re-identification, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks. In general, the interaction between anonymization and data utility requires a deeper understanding within the context of LLMs. In this paper, we propose a framework composed of three key LLM-based components: $\textit{a privacy evaluator}$, $\textit{a utility evaluator}$, and $\textit{an optimization component}$, which work collaboratively to perform anonymization. Extensive experiments demonstrate that the proposed model outperforms existing baselines, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. We provide detailed studies on these core modules. To consider large-scale and real-time applications, we investigate the distillation of the anonymization capabilities into lightweight models. All of our code and datasets will be made publicly available at $\texttt{[Github URL]}$.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Text anonymization, Security and privacy, Large language model

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 1152

Loading