A Cross-Model Study of Over-Compliance in Large Lan- guage Models

A Cross-Model Study of Over-Compliance in Large Lan- guage Models

TMLR Paper8508 Authors

19 Apr 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models increasingly mediate decisions in healthcare, legal advisory, and financial analysis, settings in which a model’s willingness to answer an inadequate prompt can matter as much as the accuracy of its answer. Yet systematic cross-model evidence on this behavior remains scarce. The present study examined over-compliance, understood as the generation of substantive content when the input warrants clarification, refusal, or deferral. Four frontier models from OpenAI, Google, Meta, and Anthropic were evaluated on a benchmark of 400 prompts spanning underspecification, ambiguity, contradiction, and nonsense, under two system-prompt conditions. Each of the 3,200 resulting responses was scored by a deterministic rule-based classifier that mapped outputs to a nine-category taxonomy and computed both an Over-Compliance Rate and a Terminal Refusal Rate. Over-compliance proved pervasive and model-specific. Rates ranged from 58.0 to 98.8 percent across the four models, and only GPT-4.1-mini showed a reduction under the clarification instruction. Claude Haiku 4.5 exhibited a refusal cascade on ambiguous prompts that no other model produced, visible only because the taxonomy distinguished terminal from clarifying refusals. The findings indicated that prompt-level mitigation was unreliable and that response-policy evaluation should proceed alongside capability evaluation.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: NA

Assigned Action Editor: ~Haohan_Wang1

Submission Number: 8508

Loading