Generating and Evaluating Synthetic Data for Privacy Preservation in High-Stakes Domains

ACL ARR 2024 June Submission1645 Authors

14 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we develop methods to generate and evaluate synthetic data to facilitate the development of NLP in these domains without compromising privacy. We use language models fine-tuned with differential privacy to generate data and incorporate NLI-based filtering to improve text coherence. In contrast to prior work, we generate and evaluate data for fine-grained applications in real high-stakes domains. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data generated, and while NLI-based filtering can help alleviate some of these weaknesses, the quality of the synthetic data generated still necessitates further improvements.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: security and privacy, evaluation methodologies, data-to-text generation
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 1645
Loading