Adaptive Incentive Design for Markov Decision Processes with Unknown Rewards

Published: 29 Mar 2025, Last Modified: 29 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Incentive design, also known as model design or environment design for Markov decision processes(MDPs), refers to a class of problems in which a leader can incentivize his follower by modifying the follower's reward function, in anticipation that the follower's optimal policy in the resulting MDP can be desirable for the leader's objective. In this work, we propose gradient-ascent algorithms to compute the leader's optimal incentive design, despite the lack of knowledge about the follower's reward function. First, we formulate the incentive design problem as a bi-level optimization problem and demonstrate that, by the softmax temporal consistency between the follower's policy and value function, the bi-level optimization problem can be reduced to single-level optimization, for which a gradient-based algorithm can be developed to optimize the leader's objective. We establish several key properties of incentive design in MDPs and prove the convergence of the proposed gradient-based method. Next, we show that the gradient terms can be estimated from observations of the follower's best response policy, enabling the use of a stochastic gradient-ascent algorithm to compute a locally optimal incentive design without knowing or learning the follower's reward function. Finally, we analyze the conditions under which an incentive design remains optimal for two different rewards which are policy invariant. The effectiveness of the proposed algorithm is demonstrated using a small probabilistic transition system and a stochastic gridworld.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. We added related work on inverse RL and clarify the discussion in section 4: "Our proposed adaptive incentive design requires no knowledge or learning of P2’s reward function. An alternative approach is to learn P2’s reward function from observed trajectories of P2, using inverse reinforcement learning methods (Abbeel & Ng, 2004; Ng & Russell, 2000; Ramachandran & Amir, 2007; Ziebart et al., 2008), and then apply the incentive design with the learned P2’s reward case. For more information about inverse reinforcement learning algorithms, the readers are referred to the survey by Arora & Doshi (2021)." "Let’s refer the second method as reward learning-based incentive design. However, it is known that inverse reinforcement learning is ill-posed because there can be multiple reward functions that generate the same optimal policy (Arora & Doshi, 2021). In that case, no amount of data can distinguish them. These reward functions are known as policy invariant." 2. We added a link to the source code, on page 11, Section 5: Code: https://github.com/alexalvis/IncentiveFollower. 3. We have revised the section titles as follows to be more informative: 3. Adapting Incentive Design via Gradient Ascent 3.1 Computing the total gradient: The case with known P2's reward 3.2 Estimating the total gradient: The case with Unknown P2's reward
Code: https://github.com/alexalvis/IncentiveFollower
Assigned Action Editor: ~Pascal_Poupart2
Submission Number: 3304
Loading