Keywords: Clinical NLP, Corpus Linguistics, LLM-as-a-judge, German, Data Augmentation, Synthetic Data
Abstract: Text corpora in non-English clinical contexts are sparse, where synthetic data generation with Large Language Models (LLMs) appears as a promising strategy to overcome this data gap. In order to test the quality of LLMs in generating synthetic data, we applied them to our novel German Medical Interview Questions Corpus (GerMedIQ), consisting of 4,524 unique question-response pairs in German. We augmented our corpus by asking a cohort of models to produce suitable responses to the same questions. Structural and semantic evaluations of the synthetic responses revealed that while augmented responses may meet the grammatical requirements, most models were not able to produce semantically comparable responses to humans. Also, an LLM-as-a-judge experiment showcased that human responses were consistently rated more appropriate than synthetic ones. We find that data augmentation with LLMs in non-English and clinical domain contexts has to be performed carefully.
Archival Status: Archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 322
Loading