Keywords: Reinforcement Learning, Exogenous Markov Decision Processes, Regret Analysis, Linear Function Approximation, Exploration-free
Abstract: Exogenous Markov Decision Processes (Exo-MDPs) capture sequential decision-making with independent exogenous dynamics, arising in applications such as inventory control, energy storage, and resource management. Prior work in approximate dynamic programming demonstrates that pure exploitation can be highly effective, with convergence in certain settings but no general regret guarantees. In contrast, reinforcement learning approaches to Exo-MDPs almost exclusively rely on explicit exploration via optimism or hindsight optimization, leaving open whether exploitation alone can achieve provable guarantees. We resolve this question by proving the first near-optimal regret bounds for pure exploitation strategies under linear function approximation. Our key technical contribution is a novel analysis based on counterfactual trajectories and post-decision states, which yields regret bounds polynomial in the endogenous feature dimension, exogenous state space, and horizon, and importantly independent of the endogenous state and action cardinalities. Experiments on synthetic and resource management benchmarks confirm that pure exploitation surpasses exploration-based methods.
Primary Area: reinforcement learning
Submission Number: 19186
Loading