Demo: Statistically Significant Results on Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Keywords: hallucination detection, omission detection, agentic, llm-as-a-judge, bias, llm
Abstract: Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts should provide consistent advice in situations where non-medical factors are involved, such as demographic information which is not clinically relevant to the question. We try to understand the conditions under which medical chatbots fail to perform as expected by creating an infrastructure that 1) automatically creates prompts to probe LLMs and 2) evaluates their answers using multiple steps and subsystems, including LLM-as-judge. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. Finally, using a subset of our 3.7M prompt dataset, we discover that only specific answering \& evaluation LLM pairs produce statistically significant differences between treatment categorization in genders and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluator in order to avoid arriving at statistically significant but non-generalizable results, especially when ground-truth data is not readily available.
Submission Number: 28
Loading