Keywords: decision making, reinforcement learning, online learning
Abstract: We study decision making with structured observation (DMSO). The complexity
for DMSO has been characterized by a series of work [ FKQR21 , CMB22 , FGH23 ].
Still, there is a gap between known regret upper and lower bounds: current upper
bounds incur a model estimation error that scales with the size of the model class.
The work of [FGQ+23 ] made an initial attempt to reduce the estimation error to
only scale with the size of the value function set, resulting in the complexity called
optimistic decision-estimation coefficient (optimistic DEC). Yet, their approach
relies on the optimism principle to drive exploration, which deviates from the
general idea of DEC that drives exploration only through information gain.
In this work, we introduce an improved model-free DEC, called Dig-DEC, that
removes the optimism mechanism in [FGQ+23 ], making it more aligned with
existing model-based DEC. Dig-DEC is always upper bounded by optimistic DEC,
and could be significantly smaller in special cases. Importantly, the removal of
optimism allows it to seamlessly handle adversarial environments, while it was
unclear how to achieve it within the optimistic DEC framework. By applying
Dig-DEC to hybrid MDPs where the transition is stochastic but the reward is
adversarial, we provide the first model-free regret bounds in hybrid MDPs with
bandit feedback in multiple settings: bilinear classes, Bellman-complete MDPs
with bounded Bellman-eluder dimension or coverability, resolving the main open
problem left by [LWZ25].
We also improve online function-estimation procedure used in model-free learning:
For average estimation error minimization, we improve the estimator to achieve
better concentration. This improves the $T^{\frac{3}{4}}$ and $T^{\frac{5}{6}}$ regret of [FGQ+23 ] to $T^{\frac{2}{3}}$and
$T^{\frac{7}{9}}$ in the cases with on-policy and off-policy exploration. For squared estimation
error minimization in Bellman-complete MDPs, we redesign the two-timescale
procedure in [ AZ22 , FGQ+23], achieving $\sqrt{T}$ regret that improves over the $T^{\frac{2}{3}}$
regret by [ FGQ+23 ]. This is the first time the performance of a DEC-based
approach for Bellman-complete MDPs matches that of optimism-based approaches
[JLM21, XFB+23].
Primary Area: learning theory
Submission Number: 16271
Loading