Generating and Evaluating Synthetic Data for Privacy Preservation in High-Stakes Domains

Abstract: The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we develop methods to generate and evaluate synthetic data to facilitate the development of NLP in these domains without compromising privacy. We use language models fine-tuned with differential privacy to generate data and incorporate NLI-based filtering to improve text coherence. In contrast to prior work, we generate and evaluate data for fine-grained applications in real high-stakes domains. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data generated, and while NLI-based filtering can help alleviate some of these weaknesses, the quality of the synthetic data generated still necessitates further improvements.
