CRAFT-MD: A Conversational Evaluation Framework for Comprehensive Assessment of Clinical LLMs

Shreya Johri; Jaehwan Jeong; Benjamin A. Tran; Daniel I Schlessinger; Shannon Wongvibulsin; Zhuo Ran Cai; Roxana Daneshjou; Pranav Rajpurkar

CRAFT-MD: A Conversational Evaluation Framework for Comprehensive Assessment of Clinical LLMs

Shreya Johri, Jaehwan Jeong, Benjamin A. Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, Pranav Rajpurkar

Published: 29 Feb 2024, Last Modified: 02 May 2024AAAI 2024 SSS on Clinical FMsEveryoneRevisionsBibTeXCC BY 4.0

Track: Non-traditional track

Keywords: clinical LLMs, conversational AI

TL;DR: We introduce an evaluation framework (CRAFT-MD) for clinical LLMs that assesses the ability to lead clinical conversations, gather relevant medical history, synthesize information presented over multiple dialogues and provide an accurate diagnosis.

Abstract: The integration of Large Language Models (LLMs) into clinical diagnostics has the potential to transform patient-doctor interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD), a novel approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical exams, CRAFT-MD focuses on natural dialogues, using simulated AI agents to interact with LLMs in a controlled, ethical environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4 and GPT-3.5 in the context of skin diseases. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy, emphasising the need to evaluate clinical LLMs beyond static exam-questions. The introduction of CRAFT-MD marks a significant advancement in LLM testing, aiming to ensure that these models augment medical practice effectively and ethically.

Presentation And Attendance Policy: I have read and agree with the symposium's policy on behalf of myself and my co-authors.

Ethics Board Approval: No, our research does not involve datasets that need IRB approval or its equivalent.

Data And Code Availability: Yes, we will make data and code available upon acceptance.

Primary Area: Clinical foundation models

Student First Author: Yes, the primary author of the manuscript is a student.

Submission Number: 33

Loading