TL;DR: First computationally efficient algorithm for infinite-horizon average-reward linear MDPs.
Abstract: We study reinforcement learning in infinite-horizon average-reward settings with linear MDPs. Previous work addresses this problem by approximating the average-reward setting by discounted setting and employing a value iteration-based algorithm that uses clipping to constrain the span of the value function for improved statistical efficiency. However, the clipping procedure requires computing the minimum of the value function over the entire state space, which is prohibitive since the state space in linear MDP setting can be large or even infinite. In this paper, we introduce a value iteration method with efficient clipping operation that only requires computing the minimum of value functions over the set of states visited by the algorithm. Our algorithm enjoys the same regret bound as the previous work while being computationally efficient, with computational complexity that is independent of the size of the state space.
Lay Summary: We study reinforcement learning in the infinite-horizon setting, where the agent interacts with the environment indefinitely, and the objective is to maximize the long-term average reward. We focus on the linear MDP setting, where the state space can be extremely large but is equipped with a low-dimensional feature representation. Prior work approximates the average-reward objective using a discounted formulation, enabling the use of value iteration-based algorithms. However, these methods suffer from computational inefficiency, as they require computing the minimum of a value function over the entire state space, a task that becomes intractable in large or infinite spaces. In this paper, we propose new algorithmic techniques that overcome this limitation, resulting in a computationally efficient algorithm that retains the same performance guarantees as existing approaches.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: reinforcement learning theory, infinite-horizon average-reward RL, linear MDP
Submission Number: 1436
Loading