RL Algorithms are Information-State Policies in the Bayes-Adaptive MDP

Alyssa Li Dayan; Michael D Dennis; Stuart Russell

RL Algorithms are Information-State Policies in the Bayes-Adaptive MDP

Alyssa Li Dayan, Michael D Dennis, Stuart Russell

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Bayesian Reinforcement Learning, BAMDPs, Reinforcement Learning Theory, Lifelong Learning, Reward Shaping, Intrinsic Motivation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Viewing RL algorithms as hard-coded policies in BAMDPs allows a wide range of theoretical analyses, as well as insights into popular reward shaping techniques.

Abstract: RL studies the challenge of maximizing reward in unknown environments; the Bayes-Adaptive MDP (BAMDP) provides a formal specification of this problem, albeit one that may be intractable to solve directly. In this paper, rather than trying to solve the BAMDP, we use it as a theoretical resource. In particular, we view RL algorithms as *hand-written information-state policies* for the BAMDP and derive a number of insights from this approach. For instance, one simple observation from bandit theory is that optimal policies for the BAMDP, i.e., ideal RL algorithms, do not necessarily converge to optimal policies for the underlying MDP---even though RL theory has typically regarded the latter property as essential. We also apply the theory of potential-based reward shaping in the BAMDP to analyze valid forms of intrinsic motivation. We then show that BAMDP Q-values can be decomposed into separate measures of the value gained from exploration and exploitation. We finally derive a direct relationship between an RL algorithm's shaping function in the MDP and its suboptimality in the BAMDP, and use these results to clarify the roles of many forms of reward shaping.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8172

Loading