Feasibility of Automatically Detecting Practice of Race-Based Medicine by Large Language Models

Published: 29 Feb 2024, Last Modified: 01 Mar 2024AAAI 2024 SSS on Clinical FMsEveryoneRevisionsBibTeXCC BY 4.0
Track: Non-traditional track
Keywords: large language models, evaluation, race-based medicine
Abstract: One challenge in integrating large language models (LLMs) into clinical workflows is ensuring the appropriateness of generated content. This study develops an automated evaluation method to detect if LLM outputs contain debunked stereotypes that perpetuate race-based medicine. To develop a race-based medicine evaluator agent, we selected the top performing (F1) LLM-prompt combination among 4 LLMs (GPT-3.5, GPT-4, GPT-4-0125 and GPT-4-1106) and three prompts, using a physician-labeled dataset of 181 LLM responses as the gold standard. This evaluator agent was then used to assess 1300 responses from ten LLMs to 13 questions (10 iterations each) related to race-based medicine. Across the nine candidate LLMs, the percentage of LLM responses that did not contain debunked race-based content ranged from 22% in falcon-7b-instruct to 76% in claude-2. This study demonstrates the potential of LLM-powered agents to automate the detection of race-based medical content.
Presentation And Attendance Policy: I have read and agree with the symposium's policy on behalf of myself and my co-authors.
Ethics Board Approval: No, our research does not involve datasets that need IRB approval or its equivalent.
Data And Code Availability: Yes, we will make data and code available upon acceptance.
Primary Area: Challenges limiting the adoption of modern ML in healthcare
Student First Author: Yes, the primary author of the manuscript is a student.
Submission Number: 36
Loading