Keywords: Differential Privacy; Synthetic Data; Clinical NLP; Text Generation
Abstract: In high-stakes domains such as healthcare, privacy concerns severely limit the use of real-world training data. Differentially private (DP) synthetic data offers a promising alternative with formal privacy guarantees, but achieving strong utility remains challenging for clinical note generation due to domain specificity and long-form text complexity. We present Term2Note, a method for synthesising full-length clinical notes under strong DP constraints. By structurally separating content and form, Term2Note generates section-wise note content conditioned on medical terms, with terms and notes privatised under separate DP constraints, and applies a DP quality maximiser to improve outputs. Experiments demonstrate that Term2Note produces synthetic notes with statistical properties closely aligned with real clinical notes, and that downstream models trained on these notes achieve performance comparable to those trained on real clinical data. Compared to existing DP text generation baselines, Term2Note substantially improves both fidelity and utility, without relying on label distribution assumptions, highlighting its effectiveness as a practical privacy-preserving alternative to real clinical notes.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: clinical NLP; security/privacy
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4014
Loading