Rank-One Modified Value Iteration

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-ND 4.0
Abstract: In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iteration-type update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems.
Lay Summary: This paper introduces a new algorithm for solving Markov Decision Processes (MDPs), which are widely used models for decision-making in uncertain environments, such as robotics, operations research, and reinforcement learning. The algorithm is based on a policy iteration framework but departs from classical methods in how it performs policy evaluation. Specifically, it uses a rank-one approximation of the transition probability matrix during the policy evaluation step. This approximation captures the dominant behavior of the system and is closely tied to the stationary distribution of the policy-induced Markov chain. To compute this efficiently, the algorithm leverages the power method, a classical iterative technique for estimating dominant eigenvectors. Theoretically, we show that this algorithm converges to the optimal value or action-value function, with rates and computational complexity matching standard algorithms such as value iteration (for planning) and Q-Learning (for model-free learning). Empirically, however, the proposed method shows clear advantages in a range of simulations; it consistently outperforms both first-order methods and their accelerated variants. This suggests that the rank-one approximation not only simplifies the policy evaluation step but also enhances performance in practice, making it a promising alternative for both planning and learning tasks in MDPs.
Primary Area: Reinforcement Learning
Keywords: Markov decision process, dynamic programming, reinforcement learning, value iteration, Q-learning
Submission Number: 4991
Loading