NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human

ACL ARR 2024 June Submission2305 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined \ourBenchmark, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments. Our dataset is available at https://anonymous.4open.science/r/NAP-2-benchmark-for-privacy-aware-rewriting-59F4/
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking;security/privacy;NLP datasets;automatic evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2305
Loading