Keywords: stable basis training, deep neural policies
Abstract: Following the initial success of deep reinforcement learning in learning policies just by interacting with complex, high-dimensional representations, and a decade of significant research, deep neural policies have been applied to a striking variety of fields ranging from pharmaceuticals to foundation models. Yet one of the strongest assumptions in reinforcement learning is that a reward signal will be available in the MDP. While this assumption comes in handy in certain fields, such as automated financial markets, it does not naturally fit in many others where the computational complexity of providing such a signal for the task at hand is greater than the complexity of learning one. In this paper we focus on learning policies in MDPs without this assumption, and study sequential decision making without having access to information on rewards provided by the MDP. We introduce a training method in high-dimensional MDPs and provide a theoretically well-founded algorithm that significantly improves the sample complexity of deep neural policies. The theoretical and empirical analysis reported in our paper demonstrates that our method achieves substantial improvements in sample efficient training while constructing more stable and resilient policies that can generalize to uncertain environments.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 19628
Loading