Unique between the lines: benchmarking re-identification risk for text anonymization

ICLR 2026 Conference Submission20454 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: anonymization, re-identification, privacy, large language models
TL;DR: We introduce a benchmark that evaluates text anonymizers by re-identification risk, using synthetic but realistic texts seeded with U.S. demographics.
Abstract: Data containing sensitive personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs), raising the risk such data may be inadvertently leaked. Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools are generally evaluated based on their ability to remove manually annotated identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce, to the best of our knowledge, the first modern, synthetic benchmark for measuring the effectiveness of text anonymization tools at preventing re-identification. We use U.S. demographic statistics to generate synthetic, yet realistic text, that contains various direct and indirect identifiers across diverse domains and levels of difficulty. We apply a range of NER- and LLM-based text anonymization tools on our benchmark and, based on the attributes an LLM-based attacker is able to infer correctly from the anonymized text, we report the risk that any individual will be correctly re-identified in the U.S. population. Our results show that existing tools still often miss direct identifiers or leave enough indirect information for successful re-identification. Indeed, even the best anonymizer leaves a significant re-identification risk of 36% in our setup. We conduct ablations for number and type of attribute, and also study the utility and cost of anonymization. We find that NER-based methods can reduce re-identification risk substantially, albeit sometimes at a strong cost in utility. LLM-based tools remove identifiable information more precisely, yet require a higher computational cost. We will release the benchmark and encourage community efforts to expand it, so it remains a robust test as tools become better in the future.
Primary Area: datasets and benchmarks
Submission Number: 20454
Loading