An Improved Model-free Decision-estimation Coefficient with Applications in Adversarial MDPs

An Improved Model-free Decision-estimation Coefficient with Applications in Adversarial MDPs

ICLR 2026 Conference Submission16271 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: decision making, reinforcement learning, online learning

Abstract: We study decision making with structured observation (DMSO). The complexity for DMSO has been characterized by a series of work [ FKQR21 , CMB22 , FGH23 ]. Still, there is a gap between known regret upper and lower bounds: current upper bounds incur a model estimation error that scales with the size of the model class. The work of [FGQ+23 ] made an initial attempt to reduce the estimation error to only scale with the size of the value function set, resulting in the complexity called optimistic decision-estimation coefficient (optimistic DEC). Yet, their approach relies on the optimism principle to drive exploration, which deviates from the general idea of DEC that drives exploration only through information gain. In this work, we introduce an improved model-free DEC, called Dig-DEC, that removes the optimism mechanism in [FGQ+23 ], making it more aligned with existing model-based DEC. Dig-DEC is always upper bounded by optimistic DEC, and could be significantly smaller in special cases. Importantly, the removal of optimism allows it to seamlessly handle adversarial environments, while it was unclear how to achieve it within the optimistic DEC framework. By applying Dig-DEC to hybrid MDPs where the transition is stochastic but the reward is adversarial, we provide the first model-free regret bounds in hybrid MDPs with bandit feedback in multiple settings: bilinear classes, Bellman-complete MDPs with bounded Bellman-eluder dimension or coverability, resolving the main open problem left by [LWZ25]. We also improve online function-estimation procedure used in model-free learning: For average estimation error minimization, we improve the estimator to achieve better concentration. This improves the $T^{\frac{3}{4}}$ and $T^{\frac{5}{6}}$ regret of [FGQ+23 ] to $T^{\frac{2}{3}}$and $T^{\frac{7}{9}}$ in the cases with on-policy and off-policy exploration. For squared estimation error minimization in Bellman-complete MDPs, we redesign the two-timescale procedure in [ AZ22 , FGQ+23], achieving $\sqrt{T}$ regret that improves over the $T^{\frac{2}{3}}$ regret by [ FGQ+23 ]. This is the first time the performance of a DEC-based approach for Bellman-complete MDPs matches that of optimism-based approaches [JLM21, XFB+23].

Primary Area: learning theory

Submission Number: 16271

Loading