Track: Theory
Keywords: max-following, ensembling, reinforcement learning, machine learning theory
TL;DR: In large state spaces, we extend the ability to improve upon a base class of policies via max-following to settings with adversarial start states.
Abstract: Learning the optimal policy in reinforcement learning (RL) with large state and
action spaces remains a notoriously difficult problem from both computational and
statistical perspectives. A recent line of work addresses this challenge by aiming
to compete with, or improve upon, a given base class of policies. One approach,
known as max-following, selects at each state the policy from the base class whose
estimated value function is highest. In this paper, we extend the max-following
framework to the setting of regret minimization under adversarial initial states and
limited feedback. Our algorithm is oracle-efficient, achieves no-regret guarantees
with respect to the base class (and to the worst approximate max-following policy),
and avoids any dependence on the size of the state or action space. It also attains
the optimal rate in terms of the number of episodes. Additionally, we establish a
lower bound on the regret of any max-following algorithm as a function of β, a
parameter that quantifies the approximation slack in the benchmark policy class.
Finally, we empirically validate our theoretical findings on the Linear Quadratic
Regulator (LQR) problem.
Serve As Reviewer: ~Sikata_Bela_Sengupta1, ~Teodor_Vanislavov_Marinov2
Submission Number: 61
Loading