Optimized Large Language Models Accurately Identify Recurrence of VT After Ablation from Complex Medical Notes: Will Chart Review Become Obsolete?

Ruibin Feng

Optimized Large Language Models Accurately Identify Recurrence of VT After Ablation from Complex Medical Notes: Will Chart Review Become Obsolete?

Ruibin Feng

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Large Language Models, Natural Language Processing, Prompt Engineering, Electronic Health Records, Ventricular Arrhythmias

TL;DR: We propose a non-domain-specific prompting strategy, Structured Rationale Responses, that enhances the accuracy and reliability of LLM responses to nuanced inquiries in EHRs.

Abstract: Large language models (LLMs) provide impressive out-of-the-box performance to queries for which data are in the public domain using prompt engineering. However, they are less effective when analyzing important datasets that are not publicly available, such as electronic health records (EHRs), where current prompting strategies are either suboptimal or require domain-specific expertise. To overcome these limitations, we proposed a $\textit{non-domain-specific}$ prompting strategy—termed $\textbf{Structured Rationale Responses}$ (SRR)—designed to enhance the accuracy and reliability of LLM responses to nuanced inquiries in EHRs compared with expert interpretations. Specifically, SRR guides LLMs to generate responses 1) in a structured format (e.g., JSON), and 2) with rationales, which are sentences excerpted from the query note that the LLM used to support its answer. In 499 full-text EHR notes (474.6±164.3 words) in 125 patients with life-threatening heart rhythm disorders, we asked LLM whether a patient had an acute event of ventricular arrhythmias which required it to remove parse contradictory information on prior events. In an independent hold-out test set of 398 notes (471.8±160.1 words), our SRR achieved a balanced accuracy of 86.6\%±4.0\% without any in-context examples, demonstrating an average performance lift of 30.5\% over the standard prompts, 12.2\% over Zero-shot-CoT prompts, and 10.4\% over 5-shot prompts. Notably, for true positives where LLM correctly identified acute events, 94.4\%±5.2\% had at least one LLM-generated rationale considered clinically relevant by experts. Our code can be found at https://github.com/***.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3437

Loading