Keywords: Reinforcement Learning, Bilevel Optimization, Dynamic Mechanism Design, Principle- Agent problems, Contextual MDPs, Environment Design, Model Design
Abstract: Recent research has focused on providing the right incentives to learning agents in dynamic settings. Given the high-stakes applications, the design of reliable and trustworthy algorithms for these problems is paramount. In this work, we define the Bilevel Optimization on Contextual Markov Decision Processes (BO-CMDP) framework, which captures a wide range of problems such as dynamic mechanism design or principal-agent reward shaping. BO-CMDP can be viewed as a Stackelberg Game where the leader and a random context beyond the leader’s control together configure an MDP while (potentially many) followers optimize their strategies given the setting. To solve it, we propose Hyper Policy Gradient Descent (HPGD) and prove its non-asymptotic convergence. We make very weak assumptions about the information available. HPGD does not make any assumption about competition or cooperation between the agents and allows the follower to use any training procedure of which the leader is agnostic. This setting aligns with the information asymmetry present in most economic applications.
Submission Number: 19
Loading