Limitations of Automated Reflection Consolidation in LLMs for Clinical Note Extraction: Evidence for Human-in-the-Loop Requirements
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Stroke Surgery, Large language models, Human-in-the-loop, Information extraction, Prompt engineering
TL;DR: LLMs improve clinical extraction with self-reflection, but generalized corrections often lose key details. Human oversight is needed for reliable prompt optimization in clinical tasks.
Abstract: Large language models (LLMs) are increasingly used to extract clinical information from unstructured text, but systematic methods for optimizing their performance in specialized medical domains are not well established. Reflection, a prompt-based approach to improve performance, has shown promise in various settings but is usually applied to single instances rather than generalized solutions. This study explores whether open-source LLMs can consolidate multiple successful reflections into reusable prompt components for clinical text extraction tasks. We tested six LLMs on clinical notes describing endovascular thrombectomy procedures, each model extracting seven key variables. For incorrect outputs, up to five rounds of self-reflection were triggered, and three strategies were compared for guiding these reflections. Corrective reflections were then consolidated into generalized prompt components. While all models improved on individual instances following reflection, consolidation of reflections led to mixed results: some models showed modest overall improvement, while others did not. The process of summarizing effective reflections often resulted in the loss of essential details, limiting the benefits of prompt consolidation. These findings suggest that while reflection aids self-correction, effective autonomous generalization remains challenging and calls for structured human-in-the-loop oversight during the consolidation phase to preserve critical clinical information during prompt optimization.
Track: 4. Clinical Informatics
Registration Id: DTNFJT65QVV
Submission Number: 110
Loading