Keywords: Synthetic Data Generation, Prompting, Large Language Models, Clinical NLP
TL;DR: We propose a knowledge-informed prompting method which is generally useful to improve the quality of synthetic data generated via LLMs
Abstract: Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts.
Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources.
To address this challenge, we propose ClinGen, which infuses knowledge into synthetic clinical text generation using LLMs for clinical NLP tasks. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation.
Extensive studies across 7 clinical NLP tasks and 16 datasets reveal that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and enriching the diversity of generated training instances.
Supplementary Material: zip
Submission Number: 85
Loading