Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

Junyan Liu; Lillian J. Ratliff

Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

Junyan Liu, Lillian J. Ratliff

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study the repeated principal-agent bandit game, where the principal indirectly explores an unknown environment by incentivizing an agent to play arms. Unlike prior work that assumes a greedy agent with full knowledge of reward means, we consider a self-interested learning agent who iteratively updates reward estimates and may explore arbitrarily with some probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}(T^{\frac{2}{3}})$, respectively. Specifically, these algorithms rely on a novel elimination framework coupled with new search algorithms which accommodate the uncertainty from the agent's learning behavior. We then extend the framework to handle an exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{\frac{2}{3}})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when our agent model reduces to that in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{\frac{11}{12}})$ bound.

Lay Summary: This paper studies the principal-agent bandit games where the principal (e.g., platform) incentivizes an agent (e.g., buyer) to play arms even when the agent is learning over time and may not always act predictably. We develop new strategies that help the principal to provide better incentives, even when the agent explores randomly or is uncertain. These strategies are more reliable and work better than previous approaches, especially in more complex situations.

Primary Area: Theory->Online Learning and Bandits

Keywords: Bandits, Principal-agent problem, Incentive design, Regret minimization

Submission Number: 13012

Loading