Reinforcement Learning with Quasi-Hyperbolic Discounting

Published: 17 Jun 2024, Last Modified: 17 Jun 2024FoRLaC PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning has traditionally been studied with exponential discounting or the average rewards setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which often has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time $t_1,$ can be different to the one starting from $t_2.$ Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a brief two-timescale analysis, we provide evidence that our algorithm converges to invariant sets of a suitable Differential Inclusion (DI). We then formally show that any MPE would be an invariant set of our identified DI. Finally, we validate our findings numerically for the standard inventory system with stochastic demands.
Format: Short format (up to 4 pages + refs, appendix)
Publication Status: No
Submission Number: 82
Loading