Keywords: sequential decision-making, decision-estimation coefficient, regret minimization, bandits, reinforcement learning, partial monitoring
TL;DR: We introduce and analyze an anytime variant of the E2D algorithm by proposing the average-constraint decision estimation-coefficient.
Abstract: A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs.
In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions algorithm (Anytime-E2D). Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.
Submission Number: 13376
Loading