Open Problem: Order Optimal Regret Bounds for Non-Markovian Rewards

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Ideas, Open Problems and Positions Track
Keywords: Non-Markovian Rewards, Regret Bounds, Reinforcement Learning, order optimal performance guarantees
Abstract: The standard RL world model is that of a Markov Decision Process (MDP) that assumes Markovian transitions and rewards. Yet, many real-world rewards are non-Markovian. A basic premise of MDPs is that the rewards depend on the last state and action only. Some problem settings involve "double-state" or non-Markovian reward functions where the reward depends on the trajectory. Past work considered the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but we know of no principled approaches for RL with NMR. This approach is particularly interesting as it naturally extends the MDP structure. Thus, we will address the problem of policy learning from experience with such rewards. This exacerbates the misalignment between theoretical researchers and practitioners. An open problem is to develop algorithms that can efficiently solve such problems and provide provable regret bounds, even with knowledge of the transition model. We will highlight this open problem and discuss related challenges.
Submission Number: 113
Loading