Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?

Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?

ICLR 2026 Conference Submission19186 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Exogenous Markov Decision Processes, Regret Analysis, Linear Function Approximation, Exploration-free

Abstract: Exogenous Markov Decision Processes (Exo-MDPs) capture sequential decision-making with independent exogenous dynamics, arising in applications such as inventory control, energy storage, and resource management. Prior work in approximate dynamic programming demonstrates that pure exploitation can be highly effective, with convergence in certain settings but no general regret guarantees. In contrast, reinforcement learning approaches to Exo-MDPs almost exclusively rely on explicit exploration via optimism or hindsight optimization, leaving open whether exploitation alone can achieve provable guarantees. We resolve this question by proving the first near-optimal regret bounds for pure exploitation strategies under linear function approximation. Our key technical contribution is a novel analysis based on counterfactual trajectories and post-decision states, which yields regret bounds polynomial in the endogenous feature dimension, exogenous state space, and horizon, and importantly independent of the endogenous state and action cardinalities. Experiments on synthetic and resource management benchmarks confirm that pure exploitation surpasses exploration-based methods.

Primary Area: reinforcement learning

Submission Number: 19186

Loading