Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation
Keywords: Synthetic Data Generation, Large Language Model, Clinical NLP
TL;DR: We introduce a novel Retrieval-Reasoning few-shot setting framework that employs LLMs to generate clinical trials with binary labels of success/failure.
Abstract: Machine learning (ML) has exhibited considerable promise in the clinical domain. However, its capabilities are constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required required for conducting studies with human participants. Despite the advancements of large language models (LLMs) in natural language understanding and generation, limited research has explored their potential in facilitating the generation of synthetic clinical trials. To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels. Extensive experiments conducted on real clinical trials from the ClinicalTrials.gov database demonstrate that our generated synthetic data can effectively augment real datasets. Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction. Our findings suggest that leveraging LLMs for synthetic clinical trial generation holds significant promise for accelerating clinical research, enabling more effective ML models in healthcare, and upholding ethical standards for patient privacy.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13018
Loading