SwiftVI: Time-Efficient Planning and Learning with MDPs

Published: 11 Feb 2025, Last Modified: 13 May 2025MLSys 2025EveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: value iteration, Markov Decision Processes
TL;DR: We provide algorithms that perform value iteration for planning and learning with MDPs scalably in states and actions.
Abstract: Markov decision process (MDPs) find application wherever a decision-making agent acts and learns in an uncertain environment from facility management to healthcare and service provisioning. However, the tasks for learning model parameters and planning the optimal policy such an agent should follow raises high computational cost, calling for solutions that scale to large numbers of actions and states. In this paper, we propose SwiftVI, a suite of algorithms that plan and learn with MDPs scalably by organizing the set of actions for each state in priority queues and deriving bounds for backup Q-values. Our championed solution prunes the set of actions at each state utilizing a tight upper bound and a single priority queue. A thorough experimental study confirms that SwiftVI algorithms achieve high efficiency gains robustly to model parameters.
Submission Number: 36
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview