Abstract: Data scarcity presents a significant challenge in developing medical large language models (LLMs), especially in specialized fields like anesthesiology. Leveraging advanced LLMs like ChatGPT to synthesize high-quality Q&A data is a promising solution to address the data scarcity issue. However, limited research exists on methods to produce such datasets specifically for anesthesiology, despite their importance for training robust LLMs. In this paper, we explore a compositional approach to generating anesthesia data for fine-tuning medical LLMs, and we refer to the proposed compositional data generation method as CDGen. Specifically, we first introduce a compositional augmentation strategy to enhance the diversity and reliability of collected medical records. We then employ a self-talk approach to generate anesthesia Q&A data from augmented medical records. We conduct extensive experiments to demonstrate the effectiveness of our approach using common metrics and GPT-4 evaluation, and experimental results demonstrate that the proposed data generation approach can synthesize high-quality Q&A data and effectively improve the performance of LLMs in Q&A tasks related to anesthesia.
External IDs:dblp:journals/tetci/LiZYWZZZZDLWT25
Loading