Keywords: question generation, electronic health records, natural language processing
TL;DR: We introduce a scalable framework that uses LLMs and clinician verification to generate clinically useful questions from patient records, enabling large-scale evaluation of clinical question-answer systems.
Track: Findings
Abstract: Evaluating question-answer systems for electronic health records is challenging due to the high cost of annotation, limiting the realism and scale of existing benchmarks. In this work, we introduce a scalable large language model-generated, clinician-verified framework to automatically generate questions that evaluate information retrieval over longitudinal records. This framework leverages patient timelines to generate questions that emulate questions asked during chart review. We compare generation approaches that leverage a single History \& Physical (H\&P) note versus supplementing the H\&P with patient facts. Physicians approved 93\% of questions generated from the H\&P with patient facts, a 7\% increase from using the H\&P alone. Incorporating facts into the generation process yielded a 4\% increase in verifiable questions and a 30\% increase in multi-hop questions, which are the most clinically useful questions that synthesize information across multiple encounters. Our findings demonstrate the utility of our framework to support meaningful evaluations of clinical question-answer system performance at scale.
General Area: Applications and Practice
Specific Subject Areas: Dataset Release & Characterization, Natural Language Processing
Data And Code Availability: Yes
Ethics Board Approval: Yes
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 238
Loading