Oracle-Efficient Adversarial Reinforcement Learning via Max-Following

Published: 12 Jun 2025, Last Modified: 28 Jun 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Theory
Keywords: max-following, ensembling, reinforcement learning, machine learning theory
TL;DR: In large state spaces, we extend the ability to improve upon a base class of policies via max-following to settings with adversarial start states.
Abstract: Learning the optimal policy in reinforcement learning (RL) with large state and action spaces remains a notoriously difficult problem from both computational and statistical perspectives. A recent line of work addresses this challenge by aiming to compete with, or improve upon, a given base class of policies. One approach, known as max-following, selects at each state the policy from the base class whose estimated value function is highest. In this paper, we extend the max-following framework to the setting of regret minimization under adversarial initial states and limited feedback. Our algorithm is oracle-efficient, achieves no-regret guarantees with respect to the base class (and to the worst approximate max-following policy), and avoids any dependence on the size of the state or action space. It also attains the optimal rate in terms of the number of episodes. Additionally, we establish a lower bound on the regret of any max-following algorithm as a function of β, a parameter that quantifies the approximation slack in the benchmark policy class. Finally, we empirically validate our theoretical findings on the Linear Quadratic Regulator (LQR) problem.
Serve As Reviewer: ~Sikata_Bela_Sengupta1, ~Teodor_Vanislavov_Marinov2
Submission Number: 61
Loading