Keywords: Decision making, learning theory, bandits, reinforcement learning theory, online learning, decision-estimation coefficient
TL;DR: We combine the Decision-Estimation Coefficient (DEC) framework with a notion of "optimistic" estimation, allowing us to tackle model-free reinforcement learning in the DEC framework for the first time.
Abstract: We consider the problem of interactive decision making, encompassing structured bandits and reinforcement
learning with general function approximation. Recently, Foster et al. (2021) introduced the
Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision
making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper
bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts
algorithms for (supervised) online estimation into algorithms for
decision making. In this paper, we show that by combining Estimation-to-Decisions with
a specialized form of "optimistic" estimation introduced by
Zhang (2022), it is possible to obtain guarantees
that improve upon those of Foster et al. (2021) by
accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for
model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally.
Supplementary Material: pdf
Submission Number: 13372
Loading