MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, adversarial machine learning, automatic red teaming
TL;DR: An automatic redteaming method for testing the robustness of LLMs in medical question answering
Abstract: Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply robust performance in real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that could help the LLM perform in practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how robust LLM medical question-answering benchmark performance is to violations of unrealistic benchmark assumptions. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting unrealistic assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a non-parametric test for calculating the statistic significance of a successful attack. We show how to use calculate "MedFuzzed" performance on a medical QA benchmark, as well to find individual cases of statistically significant successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11424
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview