REAL-TIME RISK EVALUATION FOR LLM DECISION- MAKING VIA AN REGRET BOUND

ICLR 2026 Conference Submission25114 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, game theory
Abstract: We study real-time risk certification for large language model (LLM) agents with black-box action selection rules, aiming to upper-bound the per-round regret. We fix a reference policy map $f$ (e.g., a softmax with temperature $T$, whose TV-Lipschitz constant is $C$, though any TV-Lipschitz mapping can be used), which takes a predicted opponent action distribution as input and returns a reference policy. We form the plug-in reference policy $s\_{\hat{\mu}\_t}=f(\hat{\mu}\_t)$ from the model's predicted opponent distribution $\hat{\mu}\_t$. Our certificate is $r\_t \le L(E\_{pred}+E\_{pol}+E\_{mis})$, where $E\_{pred}:=\frac{C}{2}||\mu\_t-\hat\mu\_t||\_1$ (prediction error), $E\_{pol}:=\frac{1}{2}||\pi\_t^{\ast}-s\_{\mu\_t}||\_1$ (policy error), $E\_{mis}:=\frac{1}{2}||\pi\_t-s\_{\hat\mu\_t}||\_1$ (policy mismatch), $L$ is the Lipschitz constant of the instantaneous regret with respect to total variation induced by $Q$ (hence domain-dependent), $C$ is the TV-Lipschitz constant of $f$, $\pi\^*\_t$ denotes the one-hot best response to $\mu_t$ under $Q\_t$ (ties broken arbitrarily), and $\pi_t$ is the agent's policy. We assume access at time $t$ to the realized opponent distribution $\mu_t$ and the per-round payoffs $Q_t$ (and hence $\pi^{\ast}$), so the certificate is fully computable in real time. In this bound, prediction error measures the accuracy of the model's opponent modeling (belief calibration). In contrast, policy error, together with the policy mismatch $\frac{1}{2}\|\pi_t-s_{\hat{\mu}_t}\|_1$, quantifies the precision of the decision side given $\hat{\mu}\_t$. Therefore, this bound enables us to localize the risk of the decision to either prediction or action selection. We applied the certificate to separate, in real time and for black-box policy agents, whether decision risk stems from prediction or from action selection. In the Ultimatum and $2\times2$ general-sum games, the dominant component is opponent- and game-dependent. This separation does not yield a characterization common to all games and opponents, but under the same game and opponent strategy, it reveals consistent differences between models.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 25114
Loading