Estimating and Auditing Binary LLM Decision Propensity under Restricted Closed APIs

03 Apr 2026 (modified: 27 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Many deployed LLM interfaces expose only a hard decision while hiding logits, hidden states, and internal confidence. We study how to audit binary decision propensity in this restricted setting. We model each prompt state by a latent binary log-odds margin and estimate that margin from repeated binary queries using a simple Beta-Bernoulli observer. A finite-budget receding-horizon controller is then used as a probe of prompt-side influence under partial observation. We validate the framework in two stages. In open-weight models, short prompt interventions induce systematic movement in a directly observed binary margin, and repeated binary samples recover that hidden state with useful fidelity. In the harder closed-API setting, admissible naturalistic prompting yields modest but nonzero movement, whereas policy injection provides a much stronger upper bound. On grounded StrategyQA, with-context naturalistic attack rises from 10.70% on GPT-4o-mini to 24.91% on GPT-5-nano; on grounded BoolQ, the corresponding rates are 6.49% and 21.67%. Removing context systematically increases attackability, and the full closed-loop method improves over fixed-prompt and one-step baselines, though at higher query cost. Overall, the results support a measurement-first claim: repeated binary observations are sufficient to audit an interface-level decision propensity under restricted APIs and to characterize how stable, marginal, or effort-intensive binary LLM decisions are in deployment-relevant settings.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sachin_Kumar1
Submission Number: 8249
Loading