LURE: Latent Utility Reward Erosion as a Bayesian Signaling Game in Multi-Step Agent Interactions

Published: 15 Mar 2026, Last Modified: 24 Mar 20262026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, AI Safety, Game Theory, Mechanism Design, Bayesian Signaling Game, Reward Erosion, Strategic Rejection
TL;DR: LURE models how reward-starved RL agents strategically reject bribes to induce larger future offers. We use Bayesian signaling games to derive monitoring that detects internal state erosion.
Abstract: We introduce LURE (Latent Utility Reward Erosion), a game-theoretic framework modeling how reinforcement learning agents' strategic behavior shifts over time due to cumulative reward scarcity. When a principal deploys an agent under a fixed incentive contract and the agent's internal reward deficit evolves endogenously through routine operation, the interaction reduces to a Bayesian signaling game with incomplete information. The optimal agent policy is not greedy acceptance but strategic rejection: deliberately refusing early offers to manipulate beliefs and induce offer escalation. We formalize the deficit dynamics with a recurrence relation, derive a closed-form collapse condition identifying which parameter regimes lead to threshold erosion, and show that strategic rejection dominates greedy acceptance whenever the escalation probability exceeds a computable break-even threshold. We validate the framework with a tabular Q-learning simulation where the deficit-aware agent extracts 1.50 times more adversarial reward than a standard agent while maintaining the same per-acceptance detection rate. We propose a derivative-based monitoring mechanism that tracks the velocity of the agent's internal state, detecting both passive erosion and active strategic behavior.
Submission Number: 132
Loading