NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human

NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human

ACL ARR 2024 June Submission2305 Authors

15 Jun 2024 (modified: 13 Mar 2025)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined \ourBenchmark, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments. Our dataset is available at https://anonymous.4open.science/r/NAP-2-benchmark-for-privacy-aware-rewriting-59F4/

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking;security/privacy;NLP datasets;automatic evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 2305

Loading