Model-Free Reinforcement Learning with the Decision-Estimation Coefficient

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: Decision making, learning theory, bandits, reinforcement learning theory, online learning, decision-estimation coefficient
TL;DR: We combine the Decision-Estimation Coefficient (DEC) framework with a notion of "optimistic" estimation, allowing us to tackle model-free reinforcement learning in the DEC framework for the first time.
Abstract: We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this paper, we show that by combining Estimation-to-Decisions with a specialized form of "optimistic" estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally.
Supplementary Material: pdf
Submission Number: 13372