Accounting for Stochasticity in Studies of Large Language Model Refusal

Emma Lurie; Stephanie Wang; Sorelle A. Friedler; Danaé Metaxa

Accounting for Stochasticity in Studies of Large Language Model Refusal

Emma Lurie, Stephanie Wang, Sorelle A. Friedler, Danaé Metaxa

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: stochasticity, refusal, AI auditing

TL;DR: Single-observation refusal evaluations underestimate how stochastic LLM content moderation actually is.

Abstract: We present preliminary empirical evidence that single-observation queries are insufficient for evaluations of LLM refusal behaviors. Using a longitudinal auditing system, we issued identical prompts 100 times each across four dates to GPT-4.1 for two socially salient topics across 20 Wikipedia sources. Refusal outcomes were consistent with a stable Bernoulli process, yet 20\% of sources fell within a decision-boundary region where a single query is largely uninformative. Reliable quantification of refusals required between 15 and 25 repeated queries, well above the single-observation standard common in existing evaluations.\end{abstract}

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Non-archival

Submission Number: 66

Loading