Abstract: The scarcity of publicly available clinical corpora hinders developing and applying NLP tools in clinical research.
While existing work tackles this issue by utilizing generative models to create high-quality synthetic corpora, their methods require learning from the original in-hospital clinical documents, turning them unfeasible in practice.
To address this problem, we introduce RecordTwin, a novel synthetic corpus creation method designed to generate synthetic documents from anonymized clinical entities.
In this method, we first extract and anonymize entities from in-hospital documents to ensure the information contained in the synthetic corpus is restricted.
Then, we use a large language model to fill the context between anonymized entities.
To do so, we use a small, privacy-preserving subset of the original documents to mimic their formatting and writing style.
This approach only requires anonymized entities and a small subset of original documents in the generation process, making it more feasible in practice.
To evaluate the synthetic corpus created with our method, we conduct a proof-of-concept study using a publicly available clinical database.
Our results demonstrate that the synthetic corpus has a utility comparable to the original data and a safety advantage over baselines, highlighting the potential of RecordTwin for privacy-preserving synthetic corpus creation.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: healthcare applications, clinical NLP
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 3340
Loading