Statistical Hypothesis Testing for Auditing Robustness in Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We develop a statitical hypothesis testing method to quantify the impact of a perturbation in the input prompt on the outputs of language models
Abstract: Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.
Lay Summary: Suppose you ask a large language model (LLM) for a treatment recommendation based on the information you provide. Then, because you are feeling adventurous, you change some information in your prompt (for example, change your gender) and ask the LLM again. To your surprise, you receive a different treatment recommendation. One question you could ask is the following: is the different response *a result of* the information you have changed? The short answer is that we cannot know by just asking the question once. This is because language models will respond differently simply by chance to the same question. Therefore, what we really care about is whether the kinds of responses the LLM provides are different from what they would have been. Furthermore, we would even like to quantify (assign a number) how much such responses have changed and whether these changes are statistically significant. This is exactly what this paper does --- it establishes a way to quantify how much responses have changed and perform some statistics with those numbers. This has many useful applications. For example, this allows us to evaluate whether models change the way they respond to meaningful information changes ("true positives") or whether they start responding differently when they in fact should not be responding in a different way ("false positives"). As a motivating example, we would not trust a model if it provides different treatment recommendations because we have added a typo in your name or even changed your height if that is an irrelevant characteristic to the diagnosis. On the other hand, we can now identify whether the model in facts does not change recommendations even in the presence of new, important information that should have changed the answers. The paper builds these ideas theoretically and exemplifies them in multiple case studies.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/vanderschaarlab/dbpa
Primary Area: General Machine Learning->Everything Else
Keywords: language models, safety, interpretability, reliability
Submission Number: 16143
Loading